Skip to main content
The PhoneticHasher enables fuzzy matching based on pronunciation to catch spelling errors where words sound the same but are spelled differently.

Phonetic Hasher

mySpellChecker includes a custom PhoneticHasher optimized for Myanmar phonology. It converts a string into a phonetic code, allowing for fuzzy matching based on pronunciation.

Key Features

  • Consonant Grouping: Maps similar-sounding consonants (e.g., က vs ) to the same code.
  • Tone Normalization: Can ignore tone marks (e.g., ကာ, ကား, က) to find tonal errors.
  • Vowel Normalization: Treats short and long vowels (e.g., vs ) as identical.
  • Adaptive Length Encoding: Automatically extends the code length for compound words to preserve phonetic information.
  • Nasal Normalization: Optionally unifies various nasal endings (e.g., န်, မ်, င်) to the Anusvara () sound. Note: normalize_nasals defaults to False — you must explicitly enable it if needed.

Python API

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher()

# 1. Encoding
code1 = hasher.encode("မြန်")  # -> 'p-medial_r-vowel_a-n'
code2 = hasher.encode("မျန်")  # -> 'p-medial_y-vowel_a-n' (Note difference)

# 2. Similarity Check
is_similar = hasher.similar(code1, code2, max_distance=1)
# True (Small edit distance in phonetic space)

# 3. Finding Variants
variants = hasher.get_phonetic_variants("မြန်")
# {'မြန်', 'မျန်', 'ဗြန်', ...}

# 4. Tonal Variants (for Real-Word Error detection)
tonal_vars = hasher.get_tonal_variants("ကား")
# {'ကား', 'ကာ', 'က'}

How It Works

Encoding Process

  1. Normalization: Text is converted to Unicode NFC form.
  2. Preprocessing: If normalize_nasals=True, common nasal endings (န်, မ်, င်) are normalized to . (Disabled by default.)
  3. Mapping: Each character is mapped to a phonetic group code (e.g., KA_GROUP, MEDIAL_R).
  4. Filtering: Tone marks and viramas are optionally stripped.
  5. Concatenation: Codes are joined to form the final hash.

Scoring

The find_phonetically_similar method delegates to compute_phonetic_similarity, which uses a multi-factor scoring approach:
  1. Character-level similarity: Compares characters pairwise using Myanmar substitution costs (MYANMAR_SUBSTITUTION_COSTS), visual confusability, and phonetic group membership.
  2. Length penalty: Proportional penalty for length differences: (max_len - min_len) / max_len * 0.2.
  3. Phonetic code blending: Levenshtein distance on phonetic codes is blended with character-level similarity, with code weight scaled by input length (min(0.4, len / 20.0)).
Score=(1w)×CharSimilarity+w×CodeSimilarityLengthPenaltyScore = (1 - w) \times CharSimilarity + w \times CodeSimilarity - LengthPenalty Where CharSimilarity is the average per-character similarity using substitution costs, CodeSimilarity = 1 - Levenshtein(Code_A, Code_B) / MaxLen, and w = min(0.4, len(input) / 20.0).

Usage in Spell Checking

Note: Phonetic hashing is computed at runtime, not stored in the database schema. There is no phonetic_hash column in the database tables. Hashes are generated on-the-fly using the PhoneticHasher during lookup.
  1. Lookup: When a word is unknown (OOV), the system computes its phonetic hash at runtime.
  2. Comparison: The hash is compared against hashes computed for dictionary candidates (from SymSpell delete index).
  3. Suggestion: Candidates are matched by:
    • Exact Hash Match: Words that sound identical.
    • Near Hash Match: Words that sound very similar (e.g., slight medial difference).

Configuration

Controlled by use_phonetic in SpellCheckerConfig.