Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
This page catalogs every character set, validation set, enum, and default value defined in the constants modules — from the 33 base consonants through medial compatibility tables to POS tag constants.
Source: src/myspellchecker/core/constants/myanmar_constants.py, src/myspellchecker/core/constants/core_constants.py, src/myspellchecker/core/constants/detector_thresholds.py, and src/myspellchecker/core/constants/pipeline_constants.py
Unicode Ranges
Main Myanmar Block
# Tuple: (start, end) inclusive boundaries
MYANMAR_RANGE = (0x1000, 0x109F)
Extended Blocks
# Myanmar Extended-A: U+AA60 to U+AA7F (tuple)
MYANMAR_EXTENDED_A_RANGE = (0xAA60, 0xAA7F)
# Myanmar Extended-B: U+A9E0 to U+A9FF (tuple)
MYANMAR_EXTENDED_B_RANGE = (0xA9E0, 0xA9FF)
# Regex pattern for all Myanmar ranges
MYANMAR_RANGE_REGEX_STR = r"[\u1000-\u109F\uA9E0-\uA9FF\uAA60-\uAA7F]"
Extended-A Character Ranges (U+AA60–U+AA7F)
| Range | Usage |
|---|
| U+AA60–U+AA6F | Shan consonants |
| U+AA70–U+AA76 | Shan vowels and tones |
| U+AA77–U+AA79 | Shan symbols |
Extended-B Character Ranges (U+A9E0–U+A9FF)
| Range | Usage |
|---|
| U+A9E0–U+A9E4 | Shan letters |
| U+A9E5 | Shan sign |
| U+A9E6 | Reduplication mark |
Character Set Constants
# Characters in Extended-A block (set)
MYANMAR_EXTENDED_A_CHARS = set(chr(c) for c in range(0xAA60, 0xAA80))
# Characters in Extended-B block (set)
MYANMAR_EXTENDED_B_CHARS = set(chr(c) for c in range(0xA9E0, 0xAA00))
# All Myanmar characters combined (set)
ALL_MYANMAR_CHARS = (
set(chr(c) for c in range(0x1000, 0x10A0))
| MYANMAR_EXTENDED_A_CHARS
| MYANMAR_EXTENDED_B_CHARS
)
# Core Burmese-only characters (U+1000-U+104F minus non-standard chars)
MYANMAR_CORE_CHARS: Set[str]
# Extended Core Block (U+1050-U+109F) - Shan/Mon/Karen additions
MYANMAR_EXTENDED_CORE_BLOCK: Set[str]
# All extended blocks combined (out of scope for v1.0)
EXTENDED_MYANMAR_CHARS: Set[str]
Helper Functions
def get_myanmar_char_set(allow_extended: bool = False) -> Set[str]:
"""Get Myanmar character set based on scope."""
def has_extended_myanmar_chars(text: str) -> bool:
"""Check if text contains Extended Myanmar characters."""
def is_myanmar_text(text: str, allow_extended: bool = False) -> bool:
"""Check if text contains any Myanmar characters."""
Character Sets
Consonants
# 34 consonants (U+1000 to U+1021) plus Great Sa (U+103F)
# U+1021 (အ) is the vowel carrier but functions as a consonant syllable-initially
CONSONANTS = set(chr(i) for i in range(0x1000, 0x1022))
CONSONANTS.add(GREAT_SA) # U+103F added separately
| Char | Code Point | Name | Romanization |
|---|
| က | U+1000 | KA | ka |
| ခ | U+1001 | KHA | kha |
| ဂ | U+1002 | GA | ga |
| ဃ | U+1003 | GHA | gha |
| င | U+1004 | NGA | nga |
| စ | U+1005 | CA | sa |
| ဆ | U+1006 | CHA | hsa |
| ဇ | U+1007 | JA | za |
| ဈ | U+1008 | JHA | zha |
| ဉ | U+1009 | NYA (archaic) | nya |
| ည | U+100A | NYA | nya |
| ဋ | U+100B | TTA | tta |
| ဌ | U+100C | TTHA | ttha |
| ဍ | U+100D | DDA | dda |
| ဎ | U+100E | DDHA | ddha |
| ဏ | U+100F | NNA | nna |
| တ | U+1010 | TA | ta |
| ထ | U+1011 | THA | hta |
| ဒ | U+1012 | DA | da |
| ဓ | U+1013 | DHA | dha |
| န | U+1014 | NA | na |
| ပ | U+1015 | PA | pa |
| ဖ | U+1016 | PHA | pha |
| ဗ | U+1017 | BA | ba |
| ဘ | U+1018 | BHA | bha |
| မ | U+1019 | MA | ma |
| ယ | U+101A | YA | ya |
| ရ | U+101B | RA | ra |
| လ | U+101C | LA | la |
| ဝ | U+101D | WA | wa |
| သ | U+101E | SA | tha |
| ဟ | U+101F | HA | ha |
| ဠ | U+1020 | LLA | lla |
| အ | U+1021 | A (vowel carrier) | a |
| ဿ | U+103F | GREAT SA (added separately) | ssa |
Non-Standard Characters
# Mon/Shan specific chars in Core Block -- not used in standard Burmese
NON_STANDARD_CHARS = {
"\u1022", # SHAN LETTER A
"\u1028", # MYANMAR LETTER MON E
"\u1033", # MYANMAR VOWEL SIGN MON II
"\u1034", # MYANMAR VOWEL SIGN MON O
"\u1035", # MYANMAR VOWEL SIGN E ABOVE
}
Independent Vowels
# Stand-alone vowels (U+1021 to U+102A)
# Type: set
# Full set includes vowel carrier (U+1021) and Shan letter (U+1022)
INDEPENDENT_VOWELS = set(chr(i) for i in range(0x1021, 0x102B))
# Strict set: standard Burmese only (excludes carrier and non-standard)
INDEPENDENT_VOWELS_STRICT = {
"\u1023", # I
"\u1024", # II
"\u1025", # U
"\u1026", # UU
"\u1027", # E
"\u1029", # O
"\u102a", # AU
}
# Vowel carrier -- behaves like a consonant in syllable structure
VOWEL_CARRIER = "\u1021" # (U+1021)
Note: Use INDEPENDENT_VOWELS_STRICT for standard Burmese-only validation.
Vowel Signs (Dependent Vowels)
# Vowel signs attached to consonants (U+102B to U+1032)
# Type: set
VOWEL_SIGNS = set(chr(i) for i in range(0x102B, 0x1033))
| Char | Code Point | Name |
|---|
| ါ | U+102B | TALL AA |
| ာ | U+102C | AA |
| ိ | U+102D | I |
| ီ | U+102E | II |
| ု | U+102F | U |
| ူ | U+1030 | UU |
| ေ | U+1031 | E (left-side, stored after consonant) |
| ဲ | U+1032 | AI |
Vowel Classification
# Position-based vowel subsets (used in validation)
UPPER_VOWELS = {"\u102d", "\u102e", "\u1032"} # I, II, AI
LOWER_VOWELS = {"\u102f", "\u1030"} # U, UU
# Invalid E-vowel combinations
INVALID_E_COMBINATIONS = {"\u102d", "\u102e", "\u102f", "\u1030"}
# Valid Vowel Combinations (Digraphs)
VALID_VOWEL_COMBINATIONS = {
frozenset({"\u1031", "\u102c"}), # E + Aa
frozenset({"\u1031", "\u102b"}), # E + Tall A
frozenset({"\u102d", "\u102f"}), # I + U
}
# Anusvara (U+1036) Compatibility
ANUSVARA_ALLOWED_VOWELS = {"\u102d", "\u102f"} # I, U
# Consonant modifiers (U+103B to U+103E)
# Type: set
MEDIALS = {"\u103b", "\u103c", "\u103d", "\u103e"}
# Individual medial constants
MEDIAL_YA = "\u103b" # Ya-pin (ျ) - palatal glide /j/
MEDIAL_RA = "\u103c" # Ya-yit (ြ) - rhotic glide /r/
MEDIAL_WA = "\u103d" # Wa-hswe (ွ)
MEDIAL_HA = "\u103e" # Ha-htoe (ှ)
# Phonetic aliases
MEDIAL_YA_PIN = MEDIAL_YA # ျ - /j/ glide (palatal approximant)
MEDIAL_YA_YIT = MEDIAL_RA # ြ - /r/ glide (rhotic approximant)
Signs and Marks
# Tone marks
# Type: set -- includes all three tone/nasal markers
TONE_MARKS = {
"\u1036", # Anusvara (Thay-thay-tin) - nasalization
"\u1037", # Dot Below (Auk-myit) - creaky tone
"\u1038", # Visarga (Wit-sa-pauk) - emphatic/final
}
# Special signs (individual constants)
ANUSVARA = "\u1036" # U+1036 - Nasal mark
DOT_BELOW = "\u1037" # U+1037 - Creaky tone
VISARGA = "\u1038" # U+1038 - Emphatic/final
VIRAMA = "\u1039" # U+1039 - Stacker (pat sint)
ASAT = "\u103a" # U+103A - Vowel killer
NGA = "\u1004" # U+1004 - Nga consonant
GREAT_SA = "\u103f" # U+103F - Great Sa
# Specific vowel combinations
VOWEL_E = "\u1031" # E vowel (pre-consonant)
VOWEL_AI = "\u1032" # AI vowel
# English token placeholder
ENG_TOKEN = "<ENG>"
# Dependent various signs (U+1032 - U+103E)
DEPENDENT_VARIOUS_SIGNS = set(chr(i) for i in range(0x1032, 0x103F))
Myanmar Numerals
# Myanmar digits (U+1040 to U+1049)
# Type: set
MYANMAR_NUMERALS = {
"\u1040", # 0
"\u1041", # 1
"\u1042", # 2
"\u1043", # 3
"\u1044", # 4
"\u1045", # 5
"\u1046", # 6
"\u1047", # 7
"\u1048", # 8
"\u1049", # 9
}
# Myanmar numeral words (written form, Dict[str, int])
MYANMAR_NUMERAL_WORDS = {
"တစ်": 1,
"နှစ်": 2,
"သုံး": 3,
...
"သန်း": 1000000,
}
Punctuation
# Myanmar punctuation (U+104A to U+104F)
# Type: set (initial definition), later redefined as frozenset for MYANMAR_PUNCTUATION
MYANMAR_PUNCTUATION = set(chr(i) for i in range(0x104A, 0x1050))
# Common punctuation (mixed Myanmar and ASCII)
COMMON_PUNCTUATION = set(...)
# Myanmar-specific separators
SENTENCE_SEPARATOR = "။" # Myanmar full stop (U+104B)
PHRASE_SEPARATOR = "၊" # Myanmar comma (U+104A)
Section Marks and Logographic Particles
# Section marks (frozenset)
SECTION_MARKS = frozenset({"\u104c", "\u104d"}) # ၌ ၍
# Reference marks (frozenset)
REFERENCE_MARKS = frozenset({"\u104e", "\u104f"}) # ၎ ၏
# All logographic particles combined (frozenset)
LOGOGRAPHIC_PARTICLES = SECTION_MARKS | REFERENCE_MARKS
# Valid particles (frozenset)
VALID_PARTICLES = frozenset(["\u104c", "\u104d", "\u104e", "\u104f"])
# Myanmar special symbols (frozenset)
MYANMAR_SPECIAL_SYMBOLS = LOGOGRAPHIC_PARTICLES | MYANMAR_PUNCTUATION
Validation Sets
# All valid medial combinations in canonical order: Ya > Ra > Wa > Ha
# Type: set of strings
VALID_MEDIAL_SEQUENCES = {
# Four-medial
"ျြွှ",
# Three-medial
"ျြွ", "ျြှ", "ျွှ", "ြွှ",
# Two-medial
"ျြ", "ျွ", "ျှ", "ြွ", "ြှ", "ွှ",
# Single medials
"ျ", "ြ", "ွ", "ှ",
}
Normalization Order Weights
# UTN #11 canonical order for medials and vowels
# Type: dict (character -> numeric weight)
ORDER_WEIGHTS = {
"\u103b": 10, # Medial Ya
"\u103c": 11, # Medial Ra
"\u103d": 12, # Medial Wa
"\u103e": 13, # Medial Ha
"\u1031": 20, # Vowel E
"\u102d": 21, # Vowel I (Upper)
"\u102e": 21, # Vowel II (Upper)
"\u1032": 21, # Vowel AI (Upper)
"\u102f": 22, # Vowel U (Lower)
"\u1030": 22, # Vowel UU (Lower)
"\u102b": 21.4, # Vowel A (Tall)
"\u102c": 21.4, # Vowel AA
"\u1036": 30, # Anusvara
"\u103a": 21.5, # Asat
"\u1037": 32, # Dot Below
"\u1038": 33, # Visarga
"\u1039": 40, # Virama
}
Zero-Width Characters
# Characters to remove during normalization
# Type: set
ZERO_WIDTH_CHARS = {
"\u200b", # ZERO WIDTH SPACE
"\u200c", # ZERO WIDTH NON-JOINER
"\u200d", # ZERO WIDTH JOINER
"\ufeff", # ZERO WIDTH NO-BREAK SPACE (BOM)
}
Defines which consonants can validly combine with each medial.
# Consonants that can take Medial Ya-pin (ျ U+103B)
COMPATIBLE_YA: set # Also aliased as COMPATIBLE_YA_PIN
# Consonants that can take Medial Ya-yit (ြ U+103C)
COMPATIBLE_RA: set # Also aliased as COMPATIBLE_YA_YIT
# Consonants that can take Medial Wa (ွ U+103D)
COMPATIBLE_WA: set # Broadest compatibility
# Consonants that can take Medial Ha (ှ U+103E)
COMPATIBLE_HA: set # Sonorant consonants only
Phonetic Character Sets
# Sonorant consonants (nasals, liquids, glides)
SONORANTS: set
# Stop/obstruent consonants for syllable-final position
STOP_FINALS: set
Stacking and Kinzi
# Valid consonant stacking pairs for Pali/Sanskrit loanwords
# Type: set of (upper_consonant, lower_consonant) tuples
STACKING_EXCEPTIONS: set
# Kinzi valid followers (consonants that can follow Kinzi pattern)
# Type: set
KINZI_VALID_FOLLOWERS: set
Part-of-Speech Tag Constants
Granular particle tag constants defined in core_constants.py:
P_SUBJ = "P_SUBJ" # Subject/topic marker
P_OBJ = "P_OBJ" # Object marker
P_SENT = "P_SENT" # Sentence ending particle
P_MOD = "P_MOD" # Modifier particle
P_LOC = "P_LOC" # Location/direction marker
Skipped Context Words
High-frequency particles skipped by the Context Validator:
# Type: set
SKIPPED_CONTEXT_WORDS = {
# Interjections and emphasis particles
"ကွာ", "ဗျာ", "နော်", "ဟေ့",
"ကွ", "လေ", "ပါ", "ပဲ", "ပေါ့",
# Subject/object markers
"က", "ကို", "သည်", "တယ်",
# Locative particles
"မှာ", "မှ", "တွင်",
# Comitative/conjunctive
"နဲ့", "နှင့်", "နှင်",
# Genitive/possessive
"ရဲ့", "၏",
# Other common particles
"များ", "လည်း", "တော့", "ပြီး", "ဖို့", "အတွက်",
# Question and ending particles
"လား", "လဲ", "လို့",
# Common verb complements (too short for meaningful n-gram validation)
"ကျ", "ပြ", "ချ",
}
Morphology Constants
Suffix sets for OOV POS guessing (from core_constants.py):
# Type: frozenset[str]
VERB_SUFFIXES: frozenset[str] # e.g., "ပြီ", "ပြီး", "ခဲ့", ...
NOUN_SUFFIXES: frozenset[str] # e.g., "များ", "တွေ", "ခြင်း", ...
ADVERB_SUFFIXES: frozenset[str] # e.g., "စွာ", "တိုင်း", ...
Enums
class ValidationLevel(str, Enum):
SYLLABLE = "syllable" # Fast, only checks valid syllables
WORD = "word" # Comprehensive, checks words and context
class ErrorType(str, Enum):
SYLLABLE = "invalid_syllable"
WORD = "invalid_word"
CONTEXT_PROBABILITY = "context_probability"
GRAMMAR = "grammar_error"
PARTICLE_TYPO = "particle_typo"
MEDIAL_CONFUSION = "medial_confusion"
COLLOQUIAL_VARIANT = "colloquial_variant"
HOMOPHONE_ERROR = "homophone_error"
TONE_AMBIGUITY = "tone_ambiguity"
ZAWGYI_ENCODING = "zawgyi_encoding"
MISSING_ASAT = "missing_asat"
PARTICLE_MISUSE = "particle_misuse"
COLLOCATION_ERROR = "collocation_error"
# See /reference/error-types for the complete list of error types
Algorithm Thresholds
@dataclass(frozen=True, slots=True)
class AlgorithmThresholds:
# SymSpell defaults
max_edit_distance: int = 2
prefix_length: int = 10
# Beam search defaults (for Viterbi, joint tagging, etc.)
beam_width_default: int = 25
beam_width_minimal: int = 10
beam_width_pos_tagger: int = 15
# N-gram context checker defaults
ngram_threshold: float = 0.01
# Suggestion ranking
candidate_limit: int = 50
min_probability_denom: float = 0.001
# LRU cache size for edit distance, POS lookups, etc.
lru_cache_size: int = 4096
Usage in Code
Importing Constants
from myspellchecker.core.constants import (
CONSONANTS,
MEDIALS,
VOWEL_SIGNS,
ORDER_WEIGHTS,
ZERO_WIDTH_CHARS,
MYANMAR_RANGE,
MYANMAR_NUMERALS,
TONE_MARKS,
VALID_MEDIAL_SEQUENCES,
SKIPPED_CONTEXT_WORDS,
ValidationLevel,
ErrorType,
)
Character Classification
def classify_myanmar_char(char: str) -> str:
"""Classify a Myanmar character by type."""
if char in CONSONANTS:
return 'consonant'
elif char in MEDIALS:
return 'medial'
elif char in VOWEL_SIGNS:
return 'vowel_sign'
elif char in INDEPENDENT_VOWELS:
return 'independent_vowel'
elif char in TONE_MARKS:
return 'tone_mark'
elif char in MYANMAR_NUMERALS:
return 'numeral'
elif char in MYANMAR_PUNCTUATION:
return 'punctuation'
else:
return 'unknown'
Validation Helper
def validate_syllable_structure(syllable: str) -> tuple[bool, str]:
"""Validate Myanmar syllable structure."""
if not syllable:
return False, "Empty syllable"
if syllable[0] not in CONSONANTS:
return False, "Must start with consonant"
# Check medial order using ORDER_WEIGHTS
medials = [c for c in syllable if c in MEDIALS]
if medials:
weights = [ORDER_WEIGHTS[m] for m in medials]
if weights != sorted(weights):
return False, "Invalid medial order"
if len(medials) != len(set(medials)):
return False, "Duplicate medials"
# Check medial sequence is valid
medial_str = ''.join(medials)
if medial_str and medial_str not in VALID_MEDIAL_SEQUENCES:
return False, "Invalid medial combination"
return True, "Valid"
Canonical Character Ordering (UTN #11)
Correct canonical order for Myanmar syllable characters (Unicode storage order):
1. Consonant (required)
2. Virama (္) + stacked consonant (if stacking)
3. Medial YA (ျ) - slot 3
4. Medial RA (ြ) - slot 4
5. Medial WA (ွ) - slot 5
6. Medial HA (ှ) - slot 6
7. Vowel E (ေ) - visually left but stored here
8. Upper vowels (ိ, ီ, ဲ)
9. Tall A (ါ) or AA (ာ)
10. Asat (်) - when forming final consonant
11. Lower vowels (ု, ူ)
12. Anusvara (ံ)
13. Tone marks: Dot Below (့), Visarga (း)
14. Final Virama (္) - rare, for special stacking
Note on Asat position: The Asat (်) appears in position 10 when it forms
a final consonant cluster (e.g., မြန်), interleaving with vowels. This is
distinct from its use in Kinzi patterns where it appears earlier.
See Also