Skip to main content
When generating spelling suggestions, the library needs to know which characters sound alike, look alike, or differ only by tone. These tables power the phonetic hasher, visual confusion detection, and tonal variant generation used throughout the suggestion pipeline.

Overview

from myspellchecker.text.phonetic_data import (
    PHONETIC_GROUPS,
    VISUAL_SIMILAR,
    TONAL_GROUPS,
    COLLOQUIAL_SUBSTITUTIONS,
)

# Check phonetic group for a consonant
labial_consonants = PHONETIC_GROUPS["p"]  # ["ပ", "ဖ", "ဗ", "ဘ"]

# Check visual similarity
similar_to_i = VISUAL_SIMILAR["ိ"]  # {"ီ"}

Phonetic Groups

Characters grouped by phonetic similarity (same sound category):

Consonant Groups

Group KeyNameCharactersIPA
pLabialပ, ဖ, ဗ, ဘ/p, pʰ, b, bʰ/
tAlveolarတ, ထ, ဒ, ဓ/t, tʰ, d, dʰ/
kVelarက, ခ, ဂ, ဃ/k, kʰ, ɡ, ɡʰ/
cPalatalစ, ဆ, ဇ, ဈ/s, sʰ, z, zʰ/
Retroflexဋ, ဌ, ဍ, ဎ/ʈ, ʈʰ, ɖ, ɖʰ/

Nasal Groups

Group KeyNameCharacters
mNasal M
nNasal N
ngNasal NG
nyNasal NYည, ဉ
n_retroRetroflex N

Approximants and Liquids

Group KeyNameCharacters
lLiquid Lလ, ဠ
rLiquid R
yApproximant Y
wApproximant W

Medials

Group KeyNameCharacterUnicode
medial_yYa-yitU+103B
medial_rYa-pinU+103C
medial_wWa-sweU+103D
medial_hHa-htoeU+103E

Vowels

Group KeyNameCharactersIPA
vowel_aVowel Aာ, ါ, အ/a/
vowel_iVowel Iိ, ီ/i/
vowel_uVowel Uု, ူ/u/
vowel_eVowel E/e/
vowel_aiVowel AI/ɛ/
vowel_oVowel Oော, ော်/o/

Visual Similarity

Characters that look similar and are commonly confused:
VISUAL_SIMILAR = {
    "ိ": {"ီ"},   # short i vs long ii
    "ု": {"ူ"},   # short u vs long uu
    "ာ": {"ါ"},   # different aa marks
    "ျ": {"ြ"},   # ya-yit vs ya-pin
    "န": {"ည"},   # na vs nya
    "င": {"ည", "ဉ"},  # nga vs nya variants
    "ရ": {"ယ"},   # ra vs ya
    "ယ": {"ရ"},   # ya vs ra (bidirectional)
    "သ": {"ဿ"},   # sa vs great sa
    "ဿ": {"သ"},   # great sa vs sa (bidirectional)
    "ပ": {"ဗ"},   # pa vs ba
    "ဗ": {"ပ"},   # ba vs pa (bidirectional)
    "ည": {"ဉ"},   # nya vs archaic nya
    "ဉ": {"ည"},   # archaic nya vs nya (bidirectional)
    # Aspirated vs unaspirated pairs
    "က": {"ခ"}, "ခ": {"က"},
    "ဂ": {"ဃ"}, "ဃ": {"ဂ"},
    "စ": {"ဆ"}, "ဆ": {"စ"},
    "တ": {"ထ"}, "ထ": {"တ"},
    "ဒ": {"ဓ"}, "ဓ": {"ဒ"},
    "ဖ": {"ဘ"}, "ဘ": {"ဖ"},
    # Retroflex pairs
    "ဋ": {"ဌ"}, "ဌ": {"ဋ"},
    "ဍ": {"ဎ"}, "ဎ": {"ဍ"},
    # LA variants
    "လ": {"ဠ"}, "ဠ": {"လ"},
    # Medial confusions
    "ွ": {"ှ"}, "ှ": {"ွ"},
}

Bidirectional Mappings

Most visually similar pairs are mapped bidirectionally:
# Both directions are mapped for most pairs
VISUAL_SIMILAR["ရ"] = {"ယ"}
VISUAL_SIMILAR["ယ"] = {"ရ"}

Tonal Groups

Characters that differ by tone, commonly confused in typing:
TONAL_GROUPS = {
    # Vowel 'a' variants
    "ာ": ["ာ", "့", "း", "ား", ""],

    # Vowel 'i' variants
    "ိ": ["ိ", "ီ", "ိ့", "ီး"],
    "ီ": ["ိ", "ီ", "ိ့", "ီး"],

    # Vowel 'u' variants
    "ု": ["ု", "ူ", "ု့", "ူး"],
    "ူ": ["ု", "ူ", "ု့", "ူး"],

    # Vowel 'e' variants
    "ေ": ["ေ", "ေ့", "ေး"],
    "ဲ": ["ဲ", "ဲ့"],

    # Tone marks
    "့": ["", "း"],  # Dot Below -> empty or Visarga
    "း": ["", "့"],  # Visarga -> empty or Dot Below
}

Colloquial Substitutions

Multi-character substitutions found in colloquial/social media text:
COLLOQUIAL_SUBSTITUTIONS = {
    "အုန်း": {"ဦး"},  # Coconut -> Particle
    "အုံး": {"ဦး"},   # Pillow -> Particle
}

Usage Examples

PhoneticHasher Integration

from myspellchecker.text.phonetic import PhoneticHasher
from myspellchecker.text.phonetic_data import PHONETIC_GROUPS

class PhoneticHasher:
    def __init__(self):
        # Build reverse mapping from char to group
        self.char_to_group = {}
        for group, chars in PHONETIC_GROUPS.items():
            for char in chars:
                self.char_to_group[char] = group

    def hash(self, word: str) -> str:
        """Generate phonetic hash."""
        result = []
        for char in word:
            if char in self.char_to_group:
                result.append(self.char_to_group[char])
            else:
                result.append(char)
        return "".join(result)

Visual Confusion Detection

from myspellchecker.text.phonetic_data import VISUAL_SIMILAR

def find_visual_variants(word: str) -> List[str]:
    """Generate visually similar variants of a word."""
    variants = []
    for i, char in enumerate(word):
        if char in VISUAL_SIMILAR:
            for similar in VISUAL_SIMILAR[char]:
                variant = word[:i] + similar + word[i+1:]
                variants.append(variant)
    return variants

# Example
variants = find_visual_variants("ကိုယ်")
# ["ကီုယ်"] - short i replaced with long ii

Tonal Variant Generation

from myspellchecker.text.phonetic_data import TONAL_GROUPS

def generate_tonal_variants(word: str) -> List[str]:
    """Generate tonal variants of a word."""
    variants = [word]
    for i, char in enumerate(word):
        if char in TONAL_GROUPS:
            for variant_char in TONAL_GROUPS[char]:
                if variant_char != char:
                    variant = word[:i] + variant_char + word[i+1:]
                    variants.append(variant)
    return variants

# Example
variants = generate_tonal_variants("လာ")
# ["လာ", "လာ့", "လား", ...]

Data Constants

Available constants and functions:
from myspellchecker.text.phonetic_data import (
    PHONETIC_GROUPS,            # Phonetic similarity groups
    VISUAL_SIMILAR,             # Visual confusability mapping
    MYANMAR_SUBSTITUTION_COSTS, # Weighted edit distance costs
    TONAL_GROUPS,               # Tonal variant mappings
    COLLOQUIAL_SUBSTITUTIONS,   # Colloquial -> standard mappings
    is_colloquial_variant,      # Check if word is colloquial
    get_standard_forms,         # Get standard forms for colloquial
    get_colloquial_variants,    # Get colloquial variants for standard
)

Phoneme-Grapheme Notes

E vs AI Vowels

The module correctly distinguishes:
  • (U+1031) - E vowel, IPA /e/, prefix position
  • (U+1032) - AI vowel, IPA /ɛ/, suffix position
These are phonetically distinct and should NOT be treated as interchangeable.

Aspirated vs Voiced

Consonant groups contain both aspirated and voiced variants:
  • (unaspirated) vs (aspirated) vs (voiced)
  • These sound similar and are often confused

See Also