Skip to main content
When generating spelling suggestions, the library needs to know which characters sound alike, look alike, or differ only by tone. These tables power the phonetic hasher, visual confusion detection, and tonal variant generation used throughout the suggestion pipeline.

Overview

from myspellchecker.text.phonetic_data import (
    PHONETIC_GROUPS,
    VISUAL_SIMILAR,
    TONAL_GROUPS,
    COLLOQUIAL_SUBSTITUTIONS,
)

# Check phonetic group for a consonant
labial_consonants = PHONETIC_GROUPS["p"]  # labial consonants list

# Check visual similarity
similar_to_i = VISUAL_SIMILAR["ိ"]  # visually similar characters

Phonetic Groups

Characters grouped by phonetic similarity (same sound category):

Consonant Groups

Group KeyNameCharactersIPA
pLabialပ, ဖ, ဗ, ဘ/p, pʰ, b, bʰ/
tAlveolarတ, ထ, ဒ, ဓ/t, tʰ, d, dʰ/
kVelarက, ခ, ဂ, ဃ/k, kʰ, ɡ, ɡʰ/
cPalatalစ, ဆ, ဇ, ဈ/s, sʰ, z, zʰ/
Retroflexဋ, ဌ, ဍ, ဎ/ʈ, ʈʰ, ɖ, ɖʰ/
sSibilantသ, ဿ/θ/
hGlottal/h/

Nasal Groups

Group KeyNameCharacters
mNasal M
nNasal N
ngNasal NG
nyNasal NYည, ဉ
n_retroRetroflex N

Approximants and Liquids

Group KeyNameCharacters
lLiquid Lလ, ဠ
rLiquid R
yApproximant Y
wApproximant W

Medials

Group KeyNameCharacterUnicode
medial_yYa-pinU+103B
medial_rYa-yitU+103C
medial_wWa-hsweU+103D
medial_hHa-htoeU+103E

Vowels

Group KeyNameCharactersIPA
vowel_aVowel Aာ, ါ (U+102B)/a/
vowel_carrierVowel Carrierအ (U+1021)/ʔa/
vowel_iVowel Iိ, ီ, ဣ (U+1023), ဤ (U+1024)/i/
vowel_uVowel Uု, ူ, ဥ (U+1025), ဦ (U+1026)/u/
vowel_eVowel Eေ, ဧ (U+1027)/e/
vowel_aiVowel AI/ɛ/
vowel_oVowel Oဩ (U+1029), ဪ (U+102A)/o/

Tone

Group KeyNameCharactersUnicode
toneTone Marksံ, ့, းU+1036 (Anusvara), U+1037 (Dot Below), U+1038 (Visarga)

Visual Similarity

Characters that look similar and are commonly confused. Accessed via VISUAL_SIMILAR dict. Most pairs are mapped bidirectionally.

Vowel and Medial Confusions

CharacterConfused WithDescription
Short i vs long ii
Long ii vs short i
Short u vs long uu
Long uu vs short u
Different aa marks
Tall AA vs regular AA
Ya-pin vs ya-yit
Ya-yit vs ya-pin
Wa-hswe vs ha-htoe
Ha-htoe vs wa-hswe

Consonant Confusions

CharacterConfused WithDescription
Na vs nya
ည, ဉNga vs nya variants
Ra vs ya
Ya vs ra
Sa vs great sa
Great sa vs sa
Pa vs ba
Ba vs pa
Nya vs archaic nya
Archaic nya vs nya

Aspirated vs Unaspirated Pairs

UnaspiratedAspirated
က

Other Confusions

CharacterConfused WithDescription
La vs great la
Great la vs la
Wa consonant (U+101D) vs zero digit (U+1040)
Zero digit (U+1040) vs Wa consonant (U+101D)

Tonal Groups

Characters that differ by tone, commonly confused in typing. Accessed via TONAL_GROUPS dict.
BaseTonal VariantsCategory
ာ, ့, း, ားVowel A
ါ, ့, း, ါးVowel A (tall AA, U+102B)
ိ, ီ, ိ့, ီးVowel I
ိ, ီ, ိ့, ီးVowel I
ု, ူ, ု့, ူးVowel U
ု, ူ, ု့, ူးVowel U
ေ, ေ့, ေးVowel E
ဲ, ဲ့Vowel AI
ောော, ော့, ော်Vowel O (combined)
(empty), းTone mark (Dot Below → Visarga)
(empty), ့Tone mark (Visarga → Dot Below)

Colloquial Substitutions

Multi-character substitutions found in colloquial/social media text. The COLLOQUIAL_SUBSTITUTIONS dict maps colloquial forms to their standard equivalents (25 entries total).

Particles

ColloquialStandardDescription
အုန်းဦးCoconut → Particle
အုံးဦးPillow → Particle

Verb Endings

ColloquialStandardDescription
ပါဘူးမပါဘူးShortened negation
တာပဲတာပါပဲShortened emphasis

Pronouns

ColloquialStandardDescription
ကျနော်ကျွန်တော်Male 1st person (colloquial)
ကျွနော်ကျွန်တော်Male 1st person (variant)
ကျမကျွန်မFemale 1st person (colloquial)
မင်းသင်2nd person (informal → formal)
ငါကျွန်တော်, ကျွန်မ1st person (very informal)
သူတို့သူများ3rd person plural

Common Words

ColloquialStandardDescription
ဟုတ်ဟုတ်ကဲ့Yes (shortened)
အိုအိုးPot/exclamation (without visarga)
အဲထိုThat (colloquial → formal)
အဲဒါထိုအရာThat thing (colloquial → formal)
ဘယ်လိုမည်သို့How (colloquial → formal)
ဘာကြောင့်အဘယ်ကြောင့်Why (colloquial → formal)

Adverbs and Reduplication

ColloquialStandardDescription
တော်တော်အလွန်Very (colloquial → formal)
သိပ်အလွန်Very (colloquial → formal)
ရမ်းရမ်းအလွန်Very (very colloquial)
ကောင်းကောင်းကောင်းမွန်စွာWell
မြန်မြန်မြန်ဆန်စွာQuickly
နှေးနှေးနှေးကွေးစွာSlowly

Contractions and Texting

ColloquialStandardDescription
လို့ပဲထို့ကြောင့်Because (contracted)
ရင်လျှင်If (colloquial → formal)
555ဟာဟာဟာLaughing (Thai style)

Reverse Mapping: STANDARD_TO_COLLOQUIAL

The STANDARD_TO_COLLOQUIAL dictionary is the inverse of COLLOQUIAL_SUBSTITUTIONS. It maps each standard form back to its set of colloquial variants. This is built automatically at module load time.

Helper Functions

from myspellchecker.text.phonetic_data import (
    is_colloquial_variant,
    get_standard_forms,
    STANDARD_TO_COLLOQUIAL,
)

is_colloquial_variant("ငါ")       # True
get_standard_forms("unknown")     # set() (empty)

Usage Examples

PhoneticHasher Integration

from myspellchecker.text.phonetic import PhoneticHasher
from myspellchecker.text.phonetic_data import PHONETIC_GROUPS

class PhoneticHasher:
    def __init__(self):
        # Build reverse mapping from char to group
        self.char_to_group = {}
        for group, chars in PHONETIC_GROUPS.items():
            for char in chars:
                self.char_to_group[char] = group

    def hash(self, word: str) -> str:
        """Generate phonetic hash."""
        result = []
        for char in word:
            if char in self.char_to_group:
                result.append(self.char_to_group[char])
            else:
                result.append(char)
        return "".join(result)

Visual Confusion Detection

from myspellchecker.text.phonetic_data import VISUAL_SIMILAR

def find_visual_variants(word: str) -> List[str]:
    """Generate visually similar variants of a word."""
    variants = []
    for i, char in enumerate(word):
        if char in VISUAL_SIMILAR:
            for similar in VISUAL_SIMILAR[char]:
                variant = word[:i] + similar + word[i+1:]
                variants.append(variant)
    return variants

# Example
variants = find_visual_variants("ကိုယ်")
# ["ကီုယ်"] - short i replaced with long ii

Tonal Variant Generation

from myspellchecker.text.phonetic_data import TONAL_GROUPS

def generate_tonal_variants(word: str) -> List[str]:
    """Generate tonal variants of a word."""
    variants = [word]
    for i, char in enumerate(word):
        if char in TONAL_GROUPS:
            for variant_char in TONAL_GROUPS[char]:
                if variant_char != char:
                    variant = word[:i] + variant_char + word[i+1:]
                    variants.append(variant)
    return variants

# Example
variants = generate_tonal_variants("လာ")
# ["လာ", "လာ့", "လား", ...]

Data Constants

Available constants and functions:
from myspellchecker.text.phonetic_data import (
    PHONETIC_GROUPS,            # Phonetic similarity groups
    VISUAL_SIMILAR,             # Visual confusability mapping
    MYANMAR_SUBSTITUTION_COSTS, # Weighted edit distance costs
    TONAL_GROUPS,               # Tonal variant mappings
    COLLOQUIAL_SUBSTITUTIONS,   # Colloquial -> standard mappings
    STANDARD_TO_COLLOQUIAL,     # Standard -> colloquial reverse mapping
    is_colloquial_variant,      # Check if word is colloquial
    get_standard_forms,         # Get standard forms for colloquial
)

Phoneme-Grapheme Notes

E vs AI Vowels

The module correctly distinguishes:
  • (U+1031) - E vowel, IPA /e/, prefix position
  • (U+1032) - AI vowel, IPA /ɛ/, suffix position
These are phonetically distinct and should NOT be treated as interchangeable.

Aspirated vs Voiced

Consonant groups contain both aspirated and voiced variants:
  • (unaspirated) vs (aspirated) vs (voiced)
  • These sound similar and are often confused

See Also