PhoneticHasher converts Myanmar text into pronunciation-based hash codes, letting the suggestion engine surface candidates that sound alike, even when their spellings differ significantly.
Phonetic Hasher
mySpellChecker includes a customPhoneticHasher optimized for Myanmar phonology. It converts a string into a phonetic code, allowing for fuzzy matching based on pronunciation.
Key Features
- Consonant Grouping: Maps similar-sounding consonants (e.g.,
ကvsဂ) to the same code. - Tone Normalization: Can ignore tone marks (e.g.,
ကာ,ကား,က) to find tonal errors. - Vowel Normalization: Treats short and long vowels (e.g.,
ိvsီ) as identical. - Adaptive Length Encoding: Automatically extends the code length for compound words to preserve phonetic information.
- Nasal Normalization: Optionally unifies various nasal endings (e.g.,
န်,မ်,င်) to the Anusvara (ံ) sound. Note:normalize_nasalsdefaults toFalse— you must explicitly enable it if needed.
Python API
How It Works
Encoding Process
- Normalization: Text is converted to Unicode NFC form.
- Preprocessing: If
normalize_nasals=True, common nasal endings (န်,မ်,င်) are normalized toံ. (Disabled by default.) - Mapping: Each character is mapped to a phonetic group code (e.g.,
KA_GROUP,MEDIAL_R). - Filtering: Tone marks and viramas are optionally stripped.
- Concatenation: Codes are joined to form the final hash.
Scoring
Thecompute_phonetic_similarity method uses a multi-factor scoring approach:
- Character-level similarity: Compares characters pairwise using Myanmar substitution costs (
MYANMAR_SUBSTITUTION_COSTS), visual confusability, and phonetic group membership. - Length penalty: Proportional penalty for length differences:
(max_len - min_len) / max_len * 0.2. - Phonetic code blending: Levenshtein distance on phonetic codes is blended with character-level similarity, with code weight scaled by input length (
min(0.4, len / 20.0)).
Score = (1 - w) * CharSimilarity + w * CodeSimilarity - LengthPenalty
Where CharSimilarity is the average per-character similarity using substitution costs, CodeSimilarity = 1 - Levenshtein(Code_A, Code_B) / MaxLen, and w = min(0.4, len(input) / 20.0).
Usage in Spell Checking
Note: Phonetic hashing is computed at runtime, not stored in the database schema. There is nophonetic_hashcolumn in the database tables. Hashes are generated on-the-fly using thePhoneticHasherduring lookup.
- Lookup: When a word is unknown (OOV), the system computes its phonetic hash at runtime.
- Comparison: The hash is compared against hashes computed for dictionary candidates (from SymSpell delete index).
- Suggestion: Candidates are matched by:
- Exact Hash Match: Words that sound identical.
- Near Hash Match: Words that sound very similar (e.g., slight medial difference).
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
ignore_tones | bool | True | Ignore tone marks in encoding, making it more forgiving of tone mark errors. |
normalize_length | bool | True | Treat short/long vowels as the same (e.g., ိ and ီ produce the same code). |
normalize_nasals | bool | False | Normalize nasal endings (န်, မ်, င်) to Anusvara (ံ). Increases recall but may cause false positives between /n/, /m/, and /ŋ/ sounds. |
max_code_length | int | 10 | Maximum length of phonetic codes. Base limit for simple words; may be extended for compounds when adaptive_length is enabled. |
adaptive_length | bool | True | Automatically extend max_code_length for compound words to prevent information loss. |
chars_per_code_unit | int | 6 | Characters per phonetic code unit for adaptive length calculation (~6 input chars per code unit). |
cache_size | int | 4096 | Maximum size of the LRU encoding cache. Set to 0 to disable caching. |
config | PhoneticConfig | None | None | Optional PhoneticConfig for similarity scoring weights. When provided, encoding params (max_code_length, chars_per_code_unit, cache_size) are taken from config unless explicitly overridden. |
Configuration
Controlled byuse_phonetic in SpellCheckerConfig.