Phonetic Data - mySpellChecker

When generating spelling suggestions, the library needs to know which characters sound alike, look alike, or differ only by tone. These tables power the phonetic hasher, visual confusion detection, and tonal variant generation used throughout the suggestion pipeline.

Overview

from myspellchecker.text.phonetic_data import (
    PHONETIC_GROUPS,
    VISUAL_SIMILAR,
    TONAL_GROUPS,
    COLLOQUIAL_SUBSTITUTIONS,
)

# Check phonetic group for a consonant
labial_consonants = PHONETIC_GROUPS["p"]  # labial consonants list

# Check visual similarity
similar_to_i = VISUAL_SIMILAR["ိ"]  # visually similar characters

Phonetic Groups

Characters grouped by phonetic similarity (same sound category):

Consonant Groups

Group Key	Name	Characters	IPA
`p`	Labial	ပ, ဖ, ဗ, ဘ	/p, pʰ, b, bʰ/
`t`	Alveolar	တ, ထ, ဒ, ဓ	/t, tʰ, d, dʰ/
`k`	Velar	က, ခ, ဂ, ဃ	/k, kʰ, ɡ, ɡʰ/
`c`	Palatal	စ, ဆ, ဇ, ဈ	/s, sʰ, z, zʰ/
`ṭ`	Retroflex	ဋ, ဌ, ဍ, ဎ	/ʈ, ʈʰ, ɖ, ɖʰ/
`s`	Sibilant	သ, ဿ	/θ/
`h`	Glottal	ဟ	/h/

Nasal Groups

Group Key	Name	Characters
`m`	Nasal M	မ
`n`	Nasal N	န
`ng`	Nasal NG	င
`ny`	Nasal NY	ည, ဉ
`n_retro`	Retroflex N	ဏ

Approximants and Liquids

Group Key	Name	Characters
`l`	Liquid L	လ, ဠ
`r`	Liquid R	ရ
`y`	Approximant Y	ယ
`w`	Approximant W	ဝ

Medials

Group Key	Name	Character	Unicode
`medial_y`	Ya-pin	ျ	U+103B
`medial_r`	Ya-yit	ြ	U+103C
`medial_w`	Wa-hswe	ွ	U+103D
`medial_h`	Ha-htoe	ှ	U+103E

Vowels

Group Key	Name	Characters	IPA
`vowel_a`	Vowel A	ာ, ါ (U+102B)	/a/
`vowel_carrier`	Vowel Carrier	အ (U+1021)	/ʔa/
`vowel_i`	Vowel I	ိ, ီ, ဣ (U+1023), ဤ (U+1024)	/i/
`vowel_u`	Vowel U	ု, ူ, ဥ (U+1025), ဦ (U+1026)	/u/
`vowel_e`	Vowel E	ေ, ဧ (U+1027)	/e/
`vowel_ai`	Vowel AI	ဲ	/ɛ/
`vowel_o`	Vowel O	ဩ (U+1029), ဪ (U+102A)	/o/

Tone

Group Key	Name	Characters	Unicode
`tone`	Tone Marks	ံ, ့, း	U+1036 (Anusvara), U+1037 (Dot Below), U+1038 (Visarga)

Visual Similarity

Characters that look similar and are commonly confused. Accessed via VISUAL_SIMILAR dict. Most pairs are mapped bidirectionally.

Vowel and Medial Confusions

Character	Confused With	Description
ိ	ီ	Short i vs long ii
ီ	ိ	Long ii vs short i
ု	ူ	Short u vs long uu
ူ	ု	Long uu vs short u
ာ	ါ	Different aa marks
ါ	ာ	Tall AA vs regular AA
ျ	ြ	Ya-pin vs ya-yit
ြ	ျ	Ya-yit vs ya-pin
ွ	ှ	Wa-hswe vs ha-htoe
ှ	ွ	Ha-htoe vs wa-hswe

Consonant Confusions

Character	Confused With	Description
န	ည	Na vs nya
င	ည, ဉ	Nga vs nya variants
ရ	ယ	Ra vs ya
ယ	ရ	Ya vs ra
သ	ဿ	Sa vs great sa
ဿ	သ	Great sa vs sa
ပ	ဗ	Pa vs ba
ဗ	ပ	Ba vs pa
ည	ဉ	Nya vs archaic nya
ဉ	ည	Archaic nya vs nya

Aspirated vs Unaspirated Pairs

Unaspirated	Aspirated
က	ခ
ဂ	ဃ
စ	ဆ
တ	ထ
ဒ	ဓ
ဖ	ဘ
ဋ	ဌ
ဍ	ဎ

Other Confusions

Character	Confused With	Description
လ	ဠ	La vs great la
ဠ	လ	Great la vs la
ဝ	၀	Wa consonant (U+101D) vs zero digit (U+1040)
၀	ဝ	Zero digit (U+1040) vs Wa consonant (U+101D)

Tonal Groups

Characters that differ by tone, commonly confused in typing. Accessed via TONAL_GROUPS dict.

Base	Tonal Variants	Category
ာ	ာ, ့, း, ား	Vowel A
ါ	ါ, ့, း, ါး	Vowel A (tall AA, U+102B)
ိ	ိ, ီ, ိ့, ီး	Vowel I
ီ	ိ, ီ, ိ့, ီး	Vowel I
ု	ု, ူ, ု့, ူး	Vowel U
ူ	ု, ူ, ု့, ူး	Vowel U
ေ	ေ, ေ့, ေး	Vowel E
ဲ	ဲ, ဲ့	Vowel AI
ော	ော, ော့, ော်	Vowel O (combined)
့	(empty), း	Tone mark (Dot Below → Visarga)
း	(empty), ့	Tone mark (Visarga → Dot Below)

Colloquial Substitutions

Multi-character substitutions found in colloquial/social media text. The COLLOQUIAL_SUBSTITUTIONS dict maps colloquial forms to their standard equivalents (25 entries total).

Particles

Colloquial	Standard	Description
အုန်း	ဦး	Coconut → Particle
အုံး	ဦး	Pillow → Particle

Verb Endings

Colloquial	Standard	Description
ပါဘူး	မပါဘူး	Shortened negation
တာပဲ	တာပါပဲ	Shortened emphasis

Pronouns

Colloquial	Standard	Description
ကျနော်	ကျွန်တော်	Male 1st person (colloquial)
ကျွနော်	ကျွန်တော်	Male 1st person (variant)
ကျမ	ကျွန်မ	Female 1st person (colloquial)
မင်း	သင်	2nd person (informal → formal)
ငါ	ကျွန်တော်, ကျွန်မ	1st person (very informal)
သူတို့	သူများ	3rd person plural

Common Words

Colloquial	Standard	Description
ဟုတ်	ဟုတ်ကဲ့	Yes (shortened)
အို	အိုး	Pot/exclamation (without visarga)
အဲ	ထို	That (colloquial → formal)
အဲဒါ	ထိုအရာ	That thing (colloquial → formal)
ဘယ်လို	မည်သို့	How (colloquial → formal)
ဘာကြောင့်	အဘယ်ကြောင့်	Why (colloquial → formal)

Adverbs and Reduplication

Colloquial	Standard	Description
တော်တော်	အလွန်	Very (colloquial → formal)
သိပ်	အလွန်	Very (colloquial → formal)
ရမ်းရမ်း	အလွန်	Very (very colloquial)
ကောင်းကောင်း	ကောင်းမွန်စွာ	Well
မြန်မြန်	မြန်ဆန်စွာ	Quickly
နှေးနှေး	နှေးကွေးစွာ	Slowly

Contractions and Texting

Colloquial	Standard	Description
လို့ပဲ	ထို့ကြောင့်	Because (contracted)
ရင်	လျှင်	If (colloquial → formal)
555	ဟာဟာဟာ	Laughing (Thai style)

Reverse Mapping: STANDARD_TO_COLLOQUIAL

The STANDARD_TO_COLLOQUIAL dictionary is the inverse of COLLOQUIAL_SUBSTITUTIONS. It maps each standard form back to its set of colloquial variants. This is built automatically at module load time.

Helper Functions

from myspellchecker.text.phonetic_data import (
    is_colloquial_variant,
    get_standard_forms,
    STANDARD_TO_COLLOQUIAL,
)

is_colloquial_variant("ငါ")       # True
get_standard_forms("unknown")     # set() (empty)

Usage Examples

PhoneticHasher Integration

from myspellchecker.text.phonetic import PhoneticHasher
from myspellchecker.text.phonetic_data import PHONETIC_GROUPS

class PhoneticHasher:
    def __init__(self):
        # Build reverse mapping from char to group
        self.char_to_group = {}
        for group, chars in PHONETIC_GROUPS.items():
            for char in chars:
                self.char_to_group[char] = group

    def hash(self, word: str) -> str:
        """Generate phonetic hash."""
        result = []
        for char in word:
            if char in self.char_to_group:
                result.append(self.char_to_group[char])
            else:
                result.append(char)
        return "".join(result)

Visual Confusion Detection

from myspellchecker.text.phonetic_data import VISUAL_SIMILAR

def find_visual_variants(word: str) -> List[str]:
    """Generate visually similar variants of a word."""
    variants = []
    for i, char in enumerate(word):
        if char in VISUAL_SIMILAR:
            for similar in VISUAL_SIMILAR[char]:
                variant = word[:i] + similar + word[i+1:]
                variants.append(variant)
    return variants

# Example
variants = find_visual_variants("ကိုယ်")
# ["ကီုယ်"] - short i replaced with long ii

Tonal Variant Generation

from myspellchecker.text.phonetic_data import TONAL_GROUPS

def generate_tonal_variants(word: str) -> List[str]:
    """Generate tonal variants of a word."""
    variants = [word]
    for i, char in enumerate(word):
        if char in TONAL_GROUPS:
            for variant_char in TONAL_GROUPS[char]:
                if variant_char != char:
                    variant = word[:i] + variant_char + word[i+1:]
                    variants.append(variant)
    return variants

# Example
variants = generate_tonal_variants("လာ")
# ["လာ", "လာ့", "လား", ...]

Data Constants

Available constants and functions:

from myspellchecker.text.phonetic_data import (
    PHONETIC_GROUPS,            # Phonetic similarity groups
    VISUAL_SIMILAR,             # Visual confusability mapping
    MYANMAR_SUBSTITUTION_COSTS, # Weighted edit distance costs
    TONAL_GROUPS,               # Tonal variant mappings
    COLLOQUIAL_SUBSTITUTIONS,   # Colloquial -> standard mappings
    STANDARD_TO_COLLOQUIAL,     # Standard -> colloquial reverse mapping
    is_colloquial_variant,      # Check if word is colloquial
    get_standard_forms,         # Get standard forms for colloquial
)

Phoneme-Grapheme Notes

E vs AI Vowels

The module correctly distinguishes:

ေ (U+1031) - E vowel, IPA /e/, prefix position
ဲ (U+1032) - AI vowel, IPA /ɛ/, suffix position

These are phonetically distinct and should NOT be treated as interchangeable.

Aspirated vs Voiced

Consonant groups contain both aspirated and voiced variants:

ပ (unaspirated) vs ဖ (aspirated) vs ဗ (voiced)
These sound similar and are often confused

Documentation Index

​Overview

​Phonetic Groups

​Consonant Groups

​Nasal Groups

​Approximants and Liquids

​Medials

​Vowels

​Tone

​Visual Similarity

​Vowel and Medial Confusions

​Consonant Confusions

​Aspirated vs Unaspirated Pairs

​Other Confusions

​Tonal Groups

​Colloquial Substitutions

​Particles

​Verb Endings

​Pronouns

​Common Words

​Adverbs and Reduplication

​Contractions and Texting

​Reverse Mapping: STANDARD_TO_COLLOQUIAL

​Helper Functions

​Usage Examples

​PhoneticHasher Integration

​Visual Confusion Detection

​Tonal Variant Generation

​Data Constants

​Phoneme-Grapheme Notes

​E vs AI Vowels

​Aspirated vs Voiced

​See Also