Phonetic Matching - mySpellChecker

The PhoneticHasher converts Myanmar text into pronunciation-based hash codes, letting the suggestion engine surface candidates that sound alike, even when their spellings differ significantly.

Phonetic Hasher

mySpellChecker includes a custom PhoneticHasher optimized for Myanmar phonology. It converts a string into a phonetic code, allowing for fuzzy matching based on pronunciation.

Key Features

Consonant Grouping: Maps similar-sounding consonants (e.g., က vs ဂ) to the same code.
Tone Normalization: Can ignore tone marks (e.g., ကာ, ကား, က) to find tonal errors.
Vowel Normalization: Treats short and long vowels (e.g., ိ vs ီ) as identical.
Adaptive Length Encoding: Automatically extends the code length for compound words to preserve phonetic information.
Nasal Normalization: Optionally unifies various nasal endings (e.g., န်, မ်, င်) to the Anusvara (ံ) sound. Note: normalize_nasals defaults to False — you must explicitly enable it if needed.

Python API

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher()

# 1. Encoding
code1 = hasher.encode("မြန်")  # -> 'p-medial_r-vowel_a-n'
code2 = hasher.encode("မျန်")  # -> 'p-medial_y-vowel_a-n' (Note difference)

# 2. Similarity Check
is_similar = hasher.similar(code1, code2, max_distance=1)
# True (Small edit distance in phonetic space)

# 3. Finding Variants
variants = hasher.get_phonetic_variants("မြန်")
# {'မြန်', 'မျန်', 'ဗြန်', ...}

# 4. Tonal Variants (for Real-Word Error detection)
tonal_vars = hasher.get_tonal_variants("ကား")
# {'ကား', 'ကာ', 'က'}

How It Works

Encoding Process

Normalization: Text is converted to Unicode NFC form.
Preprocessing: If normalize_nasals=True, common nasal endings (န်, မ်, င်) are normalized to ံ. (Disabled by default.)
Mapping: Each character is mapped to a phonetic group code (e.g., KA_GROUP, MEDIAL_R).
Filtering: Tone marks and viramas are optionally stripped.
Concatenation: Codes are joined to form the final hash.

Scoring

The compute_phonetic_similarity method uses a multi-factor scoring approach:

Character-level similarity: Compares characters pairwise using Myanmar substitution costs (MYANMAR_SUBSTITUTION_COSTS), visual confusability, and phonetic group membership.
Length penalty: Proportional penalty for length differences: (max_len - min_len) / max_len * 0.2.
Phonetic code blending: Levenshtein distance on phonetic codes is blended with character-level similarity, with code weight scaled by input length (min(0.4, len / 20.0)).

Score = (1 - w) * CharSimilarity + w * CodeSimilarity - LengthPenalty Where CharSimilarity is the average per-character similarity using substitution costs, CodeSimilarity = 1 - Levenshtein(Code_A, Code_B) / MaxLen, and w = min(0.4, len(input) / 20.0).

Usage in Spell Checking

Note: Phonetic hashing is computed at runtime, not stored in the database schema. There is no phonetic_hash column in the database tables. Hashes are generated on-the-fly using the PhoneticHasher during lookup.

Lookup: When a word is unknown (OOV), the system computes its phonetic hash at runtime.
Comparison: The hash is compared against hashes computed for dictionary candidates (from SymSpell delete index).
Suggestion: Candidates are matched by:
- Exact Hash Match: Words that sound identical.
- Near Hash Match: Words that sound very similar (e.g., slight medial difference).

Constructor Parameters

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher(
    ignore_tones=True,
    normalize_length=True,
    normalize_nasals=False,
    max_code_length=10,
    adaptive_length=True,
    chars_per_code_unit=6,
    cache_size=4096,
    config=None,  # Optional PhoneticConfig
)

Parameter	Type	Default	Description
`ignore_tones`	`bool`	`True`	Ignore tone marks in encoding, making it more forgiving of tone mark errors.
`normalize_length`	`bool`	`True`	Treat short/long vowels as the same (e.g., `ိ` and `ီ` produce the same code).
`normalize_nasals`	`bool`	`False`	Normalize nasal endings (`န်`, `မ်`, `င်`) to Anusvara (`ံ`). Increases recall but may cause false positives between `/n/`, `/m/`, and `/ŋ/` sounds.
`max_code_length`	`int`	`10`	Maximum length of phonetic codes. Base limit for simple words; may be extended for compounds when `adaptive_length` is enabled.
`adaptive_length`	`bool`	`True`	Automatically extend `max_code_length` for compound words to prevent information loss.
`chars_per_code_unit`	`int`	`6`	Characters per phonetic code unit for adaptive length calculation (~6 input chars per code unit).
`cache_size`	`int`	`4096`	Maximum size of the LRU encoding cache. Set to 0 to disable caching.
`config`	`PhoneticConfig \| None`	`None`	Optional `PhoneticConfig` for similarity scoring weights. When provided, encoding params (`max_code_length`, `chars_per_code_unit`, `cache_size`) are taken from config unless explicitly overridden.

Configuration

Controlled by use_phonetic in SpellCheckerConfig.

​Phonetic Hasher

​Key Features

​Python API

​How It Works

​Encoding Process

​Scoring

​Usage in Spell Checking

​Constructor Parameters

​Configuration