Constants Reference - mySpellChecker

This page catalogs every character set, validation set, enum, and default value defined in the constants modules — from the 33 base consonants through medial compatibility tables to POS tag constants. Source: src/myspellchecker/core/constants/myanmar_constants.py, src/myspellchecker/core/constants/core_constants.py, src/myspellchecker/core/constants/detector_thresholds.py, and src/myspellchecker/core/constants/pipeline_constants.py

Unicode Ranges

Main Myanmar Block

# Tuple: (start, end) inclusive boundaries
MYANMAR_RANGE = (0x1000, 0x109F)

Extended Blocks

# Myanmar Extended-A: U+AA60 to U+AA7F (tuple)
MYANMAR_EXTENDED_A_RANGE = (0xAA60, 0xAA7F)

# Myanmar Extended-B: U+A9E0 to U+A9FF (tuple)
MYANMAR_EXTENDED_B_RANGE = (0xA9E0, 0xA9FF)

# Regex pattern for all Myanmar ranges
MYANMAR_RANGE_REGEX_STR = r"[\u1000-\u109F\uA9E0-\uA9FF\uAA60-\uAA7F]"

Extended-A Character Ranges (U+AA60–U+AA7F)

Range	Usage
U+AA60–U+AA6F	Shan consonants
U+AA70–U+AA76	Shan vowels and tones
U+AA77–U+AA79	Shan symbols

Extended-B Character Ranges (U+A9E0–U+A9FF)

Range	Usage
U+A9E0–U+A9E4	Shan letters
U+A9E5	Shan sign
U+A9E6	Reduplication mark

Character Set Constants

# Characters in Extended-A block (set)
MYANMAR_EXTENDED_A_CHARS = set(chr(c) for c in range(0xAA60, 0xAA80))

# Characters in Extended-B block (set)
MYANMAR_EXTENDED_B_CHARS = set(chr(c) for c in range(0xA9E0, 0xAA00))

# All Myanmar characters combined (set)
ALL_MYANMAR_CHARS = (
    set(chr(c) for c in range(0x1000, 0x10A0))
    | MYANMAR_EXTENDED_A_CHARS
    | MYANMAR_EXTENDED_B_CHARS
)

# Core Burmese-only characters (U+1000-U+104F minus non-standard chars)
MYANMAR_CORE_CHARS: Set[str]

# Extended Core Block (U+1050-U+109F) - Shan/Mon/Karen additions
MYANMAR_EXTENDED_CORE_BLOCK: Set[str]

# All extended blocks combined (out of scope for v1.0)
EXTENDED_MYANMAR_CHARS: Set[str]

Helper Functions

def get_myanmar_char_set(allow_extended: bool = False) -> Set[str]:
    """Get Myanmar character set based on scope."""

def has_extended_myanmar_chars(text: str) -> bool:
    """Check if text contains Extended Myanmar characters."""

def is_myanmar_text(text: str, allow_extended: bool = False) -> bool:
    """Check if text contains any Myanmar characters."""

Character Sets

Consonants

# 34 consonants (U+1000 to U+1021) plus Great Sa (U+103F)
# U+1021 (အ) is the vowel carrier but functions as a consonant syllable-initially
CONSONANTS = set(chr(i) for i in range(0x1000, 0x1022))
CONSONANTS.add(GREAT_SA)  # U+103F added separately

Char	Code Point	Name	Romanization
က	U+1000	KA	ka
ခ	U+1001	KHA	kha
ဂ	U+1002	GA	ga
ဃ	U+1003	GHA	gha
င	U+1004	NGA	nga
စ	U+1005	CA	sa
ဆ	U+1006	CHA	hsa
ဇ	U+1007	JA	za
ဈ	U+1008	JHA	zha
ဉ	U+1009	NYA (archaic)	nya
ည	U+100A	NYA	nya
ဋ	U+100B	TTA	tta
ဌ	U+100C	TTHA	ttha
ဍ	U+100D	DDA	dda
ဎ	U+100E	DDHA	ddha
ဏ	U+100F	NNA	nna
တ	U+1010	TA	ta
ထ	U+1011	THA	hta
ဒ	U+1012	DA	da
ဓ	U+1013	DHA	dha
န	U+1014	NA	na
ပ	U+1015	PA	pa
ဖ	U+1016	PHA	pha
ဗ	U+1017	BA	ba
ဘ	U+1018	BHA	bha
မ	U+1019	MA	ma
ယ	U+101A	YA	ya
ရ	U+101B	RA	ra
လ	U+101C	LA	la
ဝ	U+101D	WA	wa
သ	U+101E	SA	tha
ဟ	U+101F	HA	ha
ဠ	U+1020	LLA	lla
အ	U+1021	A (vowel carrier)	a
ဿ	U+103F	GREAT SA (added separately)	ssa

Non-Standard Characters

# Mon/Shan specific chars in Core Block -- not used in standard Burmese
NON_STANDARD_CHARS = {
    "\u1022",  # SHAN LETTER A
    "\u1028",  # MYANMAR LETTER MON E
    "\u1033",  # MYANMAR VOWEL SIGN MON II
    "\u1034",  # MYANMAR VOWEL SIGN MON O
    "\u1035",  # MYANMAR VOWEL SIGN E ABOVE
}

Independent Vowels

# Stand-alone vowels (U+1021 to U+102A)
# Type: set
# Full set includes vowel carrier (U+1021) and Shan letter (U+1022)
INDEPENDENT_VOWELS = set(chr(i) for i in range(0x1021, 0x102B))

# Strict set: standard Burmese only (excludes carrier and non-standard)
INDEPENDENT_VOWELS_STRICT = {
    "\u1023",  # I
    "\u1024",  # II
    "\u1025",  # U
    "\u1026",  # UU
    "\u1027",  # E
    "\u1029",  # O
    "\u102a",  # AU
}

# Vowel carrier -- behaves like a consonant in syllable structure
VOWEL_CARRIER = "\u1021"  # (U+1021)

Note: Use INDEPENDENT_VOWELS_STRICT for standard Burmese-only validation.

Vowel Signs (Dependent Vowels)

# Vowel signs attached to consonants (U+102B to U+1032)
# Type: set
VOWEL_SIGNS = set(chr(i) for i in range(0x102B, 0x1033))

Char	Code Point	Name
ါ	U+102B	TALL AA
ာ	U+102C	AA
ိ	U+102D	I
ီ	U+102E	II
ု	U+102F	U
ူ	U+1030	UU
ေ	U+1031	E (left-side, stored after consonant)
ဲ	U+1032	AI

Vowel Classification

# Position-based vowel subsets (used in validation)
UPPER_VOWELS = {"\u102d", "\u102e", "\u1032"}  # I, II, AI
LOWER_VOWELS = {"\u102f", "\u1030"}             # U, UU

# Invalid E-vowel combinations
INVALID_E_COMBINATIONS = {"\u102d", "\u102e", "\u102f", "\u1030"}

# Valid Vowel Combinations (Digraphs)
VALID_VOWEL_COMBINATIONS = {
    frozenset({"\u1031", "\u102c"}),  # E + Aa
    frozenset({"\u1031", "\u102b"}),  # E + Tall A
    frozenset({"\u102d", "\u102f"}),  # I + U
}

# Anusvara (U+1036) Compatibility
ANUSVARA_ALLOWED_VOWELS = {"\u102d", "\u102f"}  # I, U

Medials

# Consonant modifiers (U+103B to U+103E)
# Type: set
MEDIALS = {"\u103b", "\u103c", "\u103d", "\u103e"}

# Individual medial constants
MEDIAL_YA = "\u103b"      # Ya-pin (ျ) - palatal glide /j/
MEDIAL_RA = "\u103c"      # Ya-yit (ြ) - rhotic glide /r/
MEDIAL_WA = "\u103d"      # Wa-hswe (ွ)
MEDIAL_HA = "\u103e"      # Ha-htoe (ှ)

# Phonetic aliases
MEDIAL_YA_PIN = MEDIAL_YA  # ျ - /j/ glide (palatal approximant)
MEDIAL_YA_YIT = MEDIAL_RA  # ြ - /r/ glide (rhotic approximant)

Signs and Marks

# Tone marks
# Type: set -- includes all three tone/nasal markers
TONE_MARKS = {
    "\u1036",  # Anusvara (Thay-thay-tin) - nasalization
    "\u1037",  # Dot Below (Auk-myit) - creaky tone
    "\u1038",  # Visarga (Wit-sa-pauk) - emphatic/final
}

# Special signs (individual constants)
ANUSVARA = "\u1036"   # U+1036 - Nasal mark
DOT_BELOW = "\u1037"  # U+1037 - Creaky tone
VISARGA = "\u1038"     # U+1038 - Emphatic/final
VIRAMA = "\u1039"      # U+1039 - Stacker (pat sint)
ASAT = "\u103a"        # U+103A - Vowel killer
NGA = "\u1004"         # U+1004 - Nga consonant
GREAT_SA = "\u103f"    # U+103F - Great Sa

# Specific vowel combinations
VOWEL_E = "\u1031"     # E vowel (pre-consonant)
VOWEL_AI = "\u1032"    # AI vowel

# English token placeholder
ENG_TOKEN = "<ENG>"

# Dependent various signs (U+1032 - U+103E)
DEPENDENT_VARIOUS_SIGNS = set(chr(i) for i in range(0x1032, 0x103F))

Myanmar Numerals

# Myanmar digits (U+1040 to U+1049)
# Type: set
MYANMAR_NUMERALS = {
    "\u1040",  # 0
    "\u1041",  # 1
    "\u1042",  # 2
    "\u1043",  # 3
    "\u1044",  # 4
    "\u1045",  # 5
    "\u1046",  # 6
    "\u1047",  # 7
    "\u1048",  # 8
    "\u1049",  # 9
}

# Myanmar numeral words (written form, Dict[str, int])
MYANMAR_NUMERAL_WORDS = {
    "တစ်": 1,
    "နှစ်": 2,
    "သုံး": 3,
    ...
    "သန်း": 1000000,
}

Punctuation

# Myanmar punctuation (U+104A to U+104F)
# Type: set (initial definition), later redefined as frozenset for MYANMAR_PUNCTUATION
MYANMAR_PUNCTUATION = set(chr(i) for i in range(0x104A, 0x1050))

# Common punctuation (mixed Myanmar and ASCII)
COMMON_PUNCTUATION = set(...)

# Myanmar-specific separators
SENTENCE_SEPARATOR = "။"  # Myanmar full stop (U+104B)
PHRASE_SEPARATOR = "၊"    # Myanmar comma (U+104A)

Section Marks and Logographic Particles

# Section marks (frozenset)
SECTION_MARKS = frozenset({"\u104c", "\u104d"})  # ၌ ၍

# Reference marks (frozenset)
REFERENCE_MARKS = frozenset({"\u104e", "\u104f"})  # ၎ ၏

# All logographic particles combined (frozenset)
LOGOGRAPHIC_PARTICLES = SECTION_MARKS | REFERENCE_MARKS

# Valid particles (frozenset)
VALID_PARTICLES = frozenset(["\u104c", "\u104d", "\u104e", "\u104f"])

# Myanmar special symbols (frozenset)
MYANMAR_SPECIAL_SYMBOLS = LOGOGRAPHIC_PARTICLES | MYANMAR_PUNCTUATION

Validation Sets

Valid Medial Sequences

# All valid medial combinations in canonical order: Ya > Ra > Wa > Ha
# Type: set of strings
VALID_MEDIAL_SEQUENCES = {
    # Four-medial
    "ျြွှ",
    # Three-medial
    "ျြွ", "ျြှ", "ျွှ", "ြွှ",
    # Two-medial
    "ျြ", "ျွ", "ျှ", "ြွ", "ြှ", "ွှ",
    # Single medials
    "ျ", "ြ", "ွ", "ှ",
}

Normalization Order Weights

# UTN #11 canonical order for medials and vowels
# Type: dict (character -> numeric weight)
ORDER_WEIGHTS = {
    "\u103b": 10,    # Medial Ya
    "\u103c": 11,    # Medial Ra
    "\u103d": 12,    # Medial Wa
    "\u103e": 13,    # Medial Ha
    "\u1031": 20,    # Vowel E
    "\u102d": 21,    # Vowel I (Upper)
    "\u102e": 21,    # Vowel II (Upper)
    "\u1032": 21,    # Vowel AI (Upper)
    "\u102f": 22,    # Vowel U (Lower)
    "\u1030": 22,    # Vowel UU (Lower)
    "\u102b": 21.4,  # Vowel A (Tall)
    "\u102c": 21.4,  # Vowel AA
    "\u1036": 30,    # Anusvara
    "\u103a": 21.5,  # Asat
    "\u1037": 32,    # Dot Below
    "\u1038": 33,    # Visarga
    "\u1039": 40,    # Virama
}

Zero-Width Characters

# Characters to remove during normalization
# Type: set
ZERO_WIDTH_CHARS = {
    "\u200b",  # ZERO WIDTH SPACE
    "\u200c",  # ZERO WIDTH NON-JOINER
    "\u200d",  # ZERO WIDTH JOINER
    "\ufeff",  # ZERO WIDTH NO-BREAK SPACE (BOM)
}

Medial Compatibility Sets

Defines which consonants can validly combine with each medial.

# Consonants that can take Medial Ya-pin (ျ U+103B)
COMPATIBLE_YA: set     # Also aliased as COMPATIBLE_YA_PIN

# Consonants that can take Medial Ya-yit (ြ U+103C)
COMPATIBLE_RA: set     # Also aliased as COMPATIBLE_YA_YIT

# Consonants that can take Medial Wa (ွ U+103D)
COMPATIBLE_WA: set     # Broadest compatibility

# Consonants that can take Medial Ha (ှ U+103E)
COMPATIBLE_HA: set     # Sonorant consonants only

Phonetic Character Sets

# Sonorant consonants (nasals, liquids, glides)
SONORANTS: set

# Stop/obstruent consonants for syllable-final position
STOP_FINALS: set

Stacking and Kinzi

# Valid consonant stacking pairs for Pali/Sanskrit loanwords
# Type: set of (upper_consonant, lower_consonant) tuples
STACKING_EXCEPTIONS: set

# Kinzi valid followers (consonants that can follow Kinzi pattern)
# Type: set
KINZI_VALID_FOLLOWERS: set

Part-of-Speech Tag Constants

Granular particle tag constants defined in core_constants.py:

P_SUBJ = "P_SUBJ"      # Subject/topic marker
P_OBJ = "P_OBJ"        # Object marker
P_SENT = "P_SENT"      # Sentence ending particle
P_MOD = "P_MOD"        # Modifier particle
P_LOC = "P_LOC"        # Location/direction marker

Skipped Context Words

High-frequency particles skipped by the Context Validator:

# Type: set
SKIPPED_CONTEXT_WORDS = {
    # Interjections and emphasis particles
    "ကွာ", "ဗျာ", "နော်", "ဟေ့",
    "ကွ", "လေ", "ပါ", "ပဲ", "ပေါ့",
    # Subject/object markers
    "က", "ကို", "သည်", "တယ်",
    # Locative particles
    "မှာ", "မှ", "တွင်",
    # Comitative/conjunctive
    "နဲ့", "နှင့်", "နှင်",
    # Genitive/possessive
    "ရဲ့", "၏",
    # Other common particles
    "များ", "လည်း", "တော့", "ပြီး", "ဖို့", "အတွက်",
    # Question and ending particles
    "လား", "လဲ", "လို့",
    # Common verb complements (too short for meaningful n-gram validation)
    "ကျ", "ပြ", "ချ",
}

Morphology Constants

Suffix sets for OOV POS guessing (from core_constants.py):

# Type: frozenset[str]
VERB_SUFFIXES: frozenset[str]     # e.g., "ပြီ", "ပြီး", "ခဲ့", ...
NOUN_SUFFIXES: frozenset[str]     # e.g., "များ", "တွေ", "ခြင်း", ...
ADVERB_SUFFIXES: frozenset[str]   # e.g., "စွာ", "တိုင်း", ...

Enums

class ValidationLevel(str, Enum):
    SYLLABLE = "syllable"  # Fast, only checks valid syllables
    WORD = "word"          # Comprehensive, checks words and context

class ErrorType(str, Enum):
    SYLLABLE = "invalid_syllable"
    WORD = "invalid_word"
    CONTEXT_PROBABILITY = "context_probability"
    GRAMMAR = "grammar_error"
    PARTICLE_TYPO = "particle_typo"
    MEDIAL_CONFUSION = "medial_confusion"
    COLLOQUIAL_VARIANT = "colloquial_variant"
    HOMOPHONE_ERROR = "homophone_error"
    TONE_AMBIGUITY = "tone_ambiguity"
    ZAWGYI_ENCODING = "zawgyi_encoding"
    MISSING_ASAT = "missing_asat"
    PARTICLE_MISUSE = "particle_misuse"
    COLLOCATION_ERROR = "collocation_error"
    # See /reference/error-types for the complete list of error types

Algorithm Thresholds

@dataclass(frozen=True, slots=True)
class AlgorithmThresholds:
    # SymSpell defaults
    max_edit_distance: int = 2
    prefix_length: int = 10

    # Beam search defaults (for Viterbi, joint tagging, etc.)
    beam_width_default: int = 25
    beam_width_minimal: int = 10
    beam_width_pos_tagger: int = 15

    # N-gram context checker defaults
    ngram_threshold: float = 0.01

    # Suggestion ranking
    candidate_limit: int = 50
    min_probability_denom: float = 0.001

    # LRU cache size for edit distance, POS lookups, etc.
    lru_cache_size: int = 4096

Usage in Code

Importing Constants

from myspellchecker.core.constants import (
    CONSONANTS,
    MEDIALS,
    VOWEL_SIGNS,
    ORDER_WEIGHTS,
    ZERO_WIDTH_CHARS,
    MYANMAR_RANGE,
    MYANMAR_NUMERALS,
    TONE_MARKS,
    VALID_MEDIAL_SEQUENCES,
    SKIPPED_CONTEXT_WORDS,
    ValidationLevel,
    ErrorType,
)

Character Classification

def classify_myanmar_char(char: str) -> str:
    """Classify a Myanmar character by type."""
    if char in CONSONANTS:
        return 'consonant'
    elif char in MEDIALS:
        return 'medial'
    elif char in VOWEL_SIGNS:
        return 'vowel_sign'
    elif char in INDEPENDENT_VOWELS:
        return 'independent_vowel'
    elif char in TONE_MARKS:
        return 'tone_mark'
    elif char in MYANMAR_NUMERALS:
        return 'numeral'
    elif char in MYANMAR_PUNCTUATION:
        return 'punctuation'
    else:
        return 'unknown'

Validation Helper

def validate_syllable_structure(syllable: str) -> tuple[bool, str]:
    """Validate Myanmar syllable structure."""
    if not syllable:
        return False, "Empty syllable"

    if syllable[0] not in CONSONANTS:
        return False, "Must start with consonant"

    # Check medial order using ORDER_WEIGHTS
    medials = [c for c in syllable if c in MEDIALS]
    if medials:
        weights = [ORDER_WEIGHTS[m] for m in medials]
        if weights != sorted(weights):
            return False, "Invalid medial order"
        if len(medials) != len(set(medials)):
            return False, "Duplicate medials"

    # Check medial sequence is valid
    medial_str = ''.join(medials)
    if medial_str and medial_str not in VALID_MEDIAL_SEQUENCES:
        return False, "Invalid medial combination"

    return True, "Valid"

Canonical Character Ordering (UTN #11)

Correct canonical order for Myanmar syllable characters (Unicode storage order):

Consonant (required)
Virama (္) + stacked consonant (if stacking)
Medial YA (ျ) - slot 3
Medial RA (ြ) - slot 4
Medial WA (ွ) - slot 5
Medial HA (ှ) - slot 6
Vowel E (ေ) - visually left but stored here
Upper vowels (ိ, ီ, ဲ)
Tall A (ါ) or AA (ာ)
Asat (်) - when forming final consonant
Lower vowels (ု, ူ)
Anusvara (ံ)
Tone marks: Dot Below (့), Visarga (း)
Final Virama (္) - rare, for special stacking

Note on Asat position: The Asat (်) appears in position 10 when it forms a final consonant cluster (e.g., မြန်), interleaving with vowels. This is distinct from its use in Kinzi patterns where it appears earlier.

​Unicode Ranges

​Main Myanmar Block

​Extended Blocks

​Extended-A Character Ranges (U+AA60–U+AA7F)

​Extended-B Character Ranges (U+A9E0–U+A9FF)

​Character Set Constants

​Helper Functions

​Character Sets

​Consonants

​Non-Standard Characters

​Independent Vowels

​Vowel Signs (Dependent Vowels)

​Vowel Classification

​Medials

​Signs and Marks

​Myanmar Numerals

​Punctuation

​Section Marks and Logographic Particles

​Validation Sets

​Valid Medial Sequences

​Normalization Order Weights

​Zero-Width Characters

​Medial Compatibility Sets

​Phonetic Character Sets

​Stacking and Kinzi

​Part-of-Speech Tag Constants

​Skipped Context Words

​Morphology Constants

​Enums

​Algorithm Thresholds

​Usage in Code

​Importing Constants

​Character Classification

​Validation Helper

​Canonical Character Ordering (UTN #11)

​See Also