v1.0 supports Standard Burmese (Myanmar) only. Other Myanmar-script languages (Shan, Karen, Mon, etc.) are planned for future releases.
Quickstart
Install, build a dictionary, and check your first text in 5 minutes.
How It Works, Layer by Layer
You can’t check what you can’t split. So mySpellChecker splits smaller, starting from syllables, the smallest reliable unit, then builds up layer by layer until nothing gets through.Layer 1: Syllable Validation
22 structural rules, no dictionary needed. Catches invalid syllable structures, medial ordering and compatibility, tone mark errors, virama and stacking issues, kinzi patterns, vowel exclusivity, diacritic uniqueness, Great Sa rules, particle typos, and corruption detection. This alone catches ~90% of errors in O(1) time.Layer 2: Word Validation
SymSpell O(1) symmetric delete lookup with a neural reranker (19-feature ONNX MLP). Handles out-of-vocabulary words, compound word resolution, ambiguous segmentation, morphological root recovery, colloquial variant detection, phonetic similarity matching, and Zawgyi detection. Only runs on text that passed syllable validation.Layer 2.5: Grammar Rules
POS tagging (3 pluggable backends: rule-based, Viterbi HMM, Transformer) plus 8 grammar checkers driven by YAML rule files. Catches register mixing (colloquial/formal), aspect marker errors, classifier-noun agreement, negation pattern mismatches, compound formation issues, merged word/particle detection, particle context errors, and tense-time agreement violations.Layer 3: Context & AI
12 priority-ordered validation strategies combining N-gram context, statistical confusable detection, MLP classification, phonetic analysis, and custom-trained AI models. Catches homophones, confusable variants (statistical + semantic via MLM), broken compounds, question structure errors, POS sequence violations, tone ambiguity, and deep semantic errors. AI strategies are opt-in and require trained models.Faster rules always run first. Each layer builds on validated output from the layer below, and positions already flagged are skipped by later strategies. This means most errors are caught cheaply, and expensive operations (N-gram lookups, AI inference) only run on text that has already passed basic validation.
Why Myanmar Spell Checking Is Hard
Building a spell checker for Myanmar is fundamentally different from English. These are the challenges that made this problem unsolved for decades and that shaped every design decision in mySpellChecker.No word boundaries
No word boundaries
Most spell checkers assume whitespace = word boundary. Myanmar text is a continuous stream of characters with no delimiters. You can’t even begin checking until you figure out where one word ends and the next begins. mySpellChecker starts from syllables, the smallest reliable unit, and builds up.
Ambiguous segmentation
Ambiguous segmentation
Without spaces, the same character sequence can be segmented into different valid words with different meanings. Only context tells you which split is correct. This makes every stage of the pipeline, from syllable grouping to word lookup, dependent on disambiguation.
Valid syllable ≠ Valid word
Valid syllable ≠ Valid word
Myanmar syllable structure has strict rules: consonant, optional medials, vowel, optional tone. A syllable can pass every structural check and still form a word that doesn’t exist. Syllable validation catches invalid structures immediately, but real-word errors need dictionary lookup, grammar rules, and context analysis.
Look-alike medials (ျ vs ြ)
Look-alike medials (ျ vs ြ)
Myanmar has four medial consonants (ျ ြ ွ ှ) that attach to base characters. The ya-pin (ျ) and ya-yit (ြ) look nearly identical but produce completely different words: ကျောင်း (school) vs ကြောင်း (reason). This is the most common Myanmar typo, and mySpellChecker has 49+ dedicated correction rules for medial confusion alone.
Confusables: sound the same, mean something different
Confusables: sound the same, mean something different
These aren’t just visual typos; they’re phonetically identical or near-identical words with different meanings. A simple dictionary lookup won’t catch them because both words exist. You need phonetic hashing, frequency analysis, and context to pick the right one.
Right word, wrong place (real-word errors)
Right word, wrong place (real-word errors)
Every word in the sentence is a real word. The sentence is still wrong. Real-word errors are the hardest class of spelling mistakes in any language. Catching these requires N-gram probabilities, POS sequence validation, or AI, which means layer 3 and beyond in the pipeline.
Negation wraps around (circumfix)
Negation wraps around (circumfix)
Myanmar negation uses a circumfix, which is a prefix (မ) and a matching sentence-final particle (ဘူး) that wrap around the verb. Drop or mismatch either part and the sentence is grammatically broken, even though every individual word is valid. This pattern is rare across world languages and requires structural grammar checking, not just dictionary lookup.
Colloquial vs Formal register mixing
Colloquial vs Formal register mixing
Neither form is “wrong” on its own, but mixing them in one sentence is a grammar error. A news article using colloquial endings, or a chat message using literary forms, both need flagging. You need POS tagging and register-aware grammar rules, not just a dictionary.
Compound word explosion
Compound word explosion
Myanmar builds words by combining simpler ones: စာ (text) + အုပ် (bundle) = စာအုပ် (book). But the number of possible combinations explodes. You can’t just check each part independently; you need a compound resolution algorithm to know which combinations actually exist.
Reduplication is grammar, not a typo
Reduplication is grammar, not a typo
Myanmar uses reduplication as a grammatical pattern: လှလှ (very beautiful), ခဏခဏ (frequently). A naive spell checker would flag these as copy-paste errors. You need grammar rules that recognize reduplication as intentional and know which words can reduplicate.
Pali & Sanskrit loanwords
Pali & Sanskrit loanwords
Words like သမ္မတ (president) use subscript stacked consonants that follow Pali/Sanskrit rules, not native Myanmar rules. The spell checker needs two rule systems running in parallel: one for native words and one for loanwords.
No capitalization, so names look like common words
No capitalization, so names look like common words
English uses capitals to signal proper nouns and sentence starts. Myanmar has no such mechanism, so မြန်မာ could be the country or the adjective. Without named entity recognition (NER), the spell checker would flag people’s names and place names as errors.
Encoding chaos (Unicode vs Zawgyi)
Encoding chaos (Unicode vs Zawgyi)
~30% of Myanmar text online still uses Zawgyi, a legacy encoding that looks identical to Unicode but uses completely different code points. Even within Unicode, the same character can have multiple valid code point orderings. Text must be detected, converted, and normalized before spell checking can even begin.
Edit distance lies for Myanmar
Edit distance lies for Myanmar
Standard Levenshtein treats all swaps equally. But in Myanmar, medial confusion (ျ↔ြ) accounts for ~30% of all errors and should cost 0.2, while a cross-class swap costs 0.5. Without Myanmar-weighted edit distance, the most common corrections rank equal to implausible ones. mySpellChecker uses a custom cost matrix derived from corpus analysis so plausible errors always surface first.
Grammar beats spelling in suggestion ranking
Grammar beats spelling in suggestion ranking
Edit distance finds the nearest word, but the nearest word may be grammatically impossible in context. A misspelled verb might have a noun as its closest spelling match. You need POS tagging and bigram probabilities to promote the candidate that actually fits the sentence, even if it’s farther in edit distance.
Sounds collapse, spellings don't
Sounds collapse, spellings don't
Myanmar speech merges distinctions that the writing system preserves. Aspirated pairs (က/ခ, စ/ဆ), three nasal endings (န်, မ်, ံ) all sounding /n/, three stop endings (က်, တ်, ပ်) all becoming a glottal stop. The corrector must know which written form is correct when they all sound identical, using frequency, context, and semantic analysis.
Key Features
12-Strategy Validation Pipeline
The checking pipeline runs up to 12 composable strategies in priority order. Each strategy builds on the output of previous ones, and positions already flagged are skipped by later strategies:| Priority | Strategy | Method | Speed | What It Catches |
|---|---|---|---|---|
| 10 | Tone Validation | Rule-based | Fast | Tone mark errors and disambiguation |
| 15 | Orthography | Rule-based | Fast | Medial order, compatibility violations |
| 20 | Syntactic Rule | YAML rules | Fast | Grammar rule violations |
| 24 | Statistical Confusable | Bigram ratio | Fast | Bigram-based confusable word detection |
| 25 | Broken Compound | Dictionary | Fast | Wrongly split compound words |
| 30 | POS Sequence | POS tagger | Moderate | Invalid POS tag sequences |
| 40 | Question | Pattern matching | Fast | Question structure errors |
| 45 | Homophone | N-gram + frequency | Moderate | Sound-alike word confusion |
| 47 | Confusable Compound Classifier | MLP (ONNX) | Fast | MLP-based confusable/compound detection |
| 48 | Confusable Semantic | AI MLM | Slow | MLM-enhanced confusable detection |
| 50 | N-gram Context | Bigram/Trigram | Moderate | Real-word errors (correct spelling, wrong context) |
| 70 | Semantic | AI masked language model | Slow | Deep context errors (ONNX model) |
Syllable Validation
The foundation layer. UsesSyllableRuleValidator (with Cython acceleration) plus dictionary lookup to validate Myanmar character combinations against orthographic rules. Catches invalid medial stacking, tone mark placement, and character sequences in O(1) time, before any dictionary or AI is needed. See Syllable Validation.
Word Validation with SymSpell
Uses the SymSpell symmetric delete algorithm for O(1) correction suggestions, which is 1,000x faster than traditional Levenshtein search. Pre-computes deletion candidates at build time so lookups are hash table hits, not edit distance calculations. Includes Myanmar-specific substitution costs that weight medial confusion (ျ↔ြ) lower than unrelated character swaps. See Word Validation and SymSpell Algorithm.Context Checking & Homophones
N-gram Context uses bigram/trigram probability tables to detect real-word errors, which are words that are spelled correctly but wrong for the context. For example, “နီ” (red) vs “နေ” (stay) are both valid words, but only one fits “ဘာလုပ်___လဲ” (what are you doing?). Homophone Detection extends this with bidirectional N-gram analysis (checks both forward and backward context) and frequency-aware guards that prevent high-frequency words from being incorrectly flagged. See Context Checking and Homophones.Grammar Checking
Eight specialized grammar checkers, each handling a different aspect of Myanmar grammar:| Checker | What It Catches | Example |
|---|---|---|
| Aspect | Aspect marker misuse | ပြီနေ (“completed + progressive”), invalid sequence |
| Classifier | Classifier-noun agreement | ခွေးဦး (“dog + polite-people classifier”) → ခွေးကောင် (“dog + animal classifier”) |
| Compound | Compound word errors | ပန်ခြံ (“missing tone”) → ပန်းခြံ (“flower garden”) |
| MergedWord | Particle-verb merging | သွားကို (“verb + object particle merged”), should be separate |
| Negation | Negation pattern errors | မသွားတယ် (“negation + affirmative ending”) → မသွားဘူး (“negation + negative ending”) |
| Particle | Particle context errors | Wrong particle usage based on surrounding POS context |
| TenseAgreement | Tense-time agreement | Tense marker contradicts temporal context |
| Register | Formal/informal mixing | ငါသွားပါသည် (“colloquial pronoun + formal ending”), register clash |
POS Tagging
Pluggable part-of-speech tagging with three backends:| Backend | Speed | Use Case |
|---|---|---|
| Rule-Based | Fast | Simple validation, resource-constrained |
| Viterbi HMM | Medium | Balanced accuracy and speed |
| Transformer | Slow | Maximum accuracy, ~93% (requires transformers package) |
AI Semantic Checking
Trains a masked language model from scratch on your corpus. At inference, masks each word and asks “what should go here?”. If the model strongly disagrees with the original word, it flags a semantic error and suggests alternatives. Handles Myanmar-specific challenges: word-aligned multi-token masking for BPE-split words, beam search for multi-token prediction, and per-model confidence calibration (XLM-RoBERTa, BERT, DistilBERT). See Semantic Checking and Semantic Algorithm.Compound & Morpheme Handling
- CompoundResolver: DP-based compound word synthesis that breaks OOV words into known components
- ReduplicationEngine: Validates productive reduplication patterns (e.g., ရှင်းရှင်းလင်းလင်း)
- Morpheme-level correction: Corrects individual morphemes within compound words instead of replacing the entire word
Named Entity Recognition
Reduces false positives by identifying names and places before spell checking. Three implementations: heuristic (fast), CRF-based, and Transformer-based (93% accuracy). NER-flagged tokens are skipped by downstream validation strategies. See NER.Dictionary Building Pipeline
Build custom dictionaries from your own text corpora:- Multi-Format Ingestion:
.txt,.csv,.tsv,.json,.jsonl,.parquet - Parallel Processing: Cython + OpenMP batch processor for fast segmentation
- N-gram Frequency: Bigram/Trigram probability tables for context checking
- Incremental Builds: Resume processing without reprocessing completed files
- Pluggable Storage: SQLite (default, disk-based) or MemoryProvider (RAM-based) with thread-safe connection pooling
AI Model Training
Two end-to-end training pipelines that handle tokenizer creation, model training, and ONNX export with INT8 quantization:| Pipeline | Model Type | Base | Inference | CLI |
|---|---|---|---|---|
| Semantic | Masked Language Model | Train from scratch (RoBERTa/BERT) | ~200ms | train-model |
| Neural Reranker | MLP (19-feature) | Train on synthetic errors | ~50us | reranker_trainer |
Myanmar Language Support
- Text Normalization: Unified service for zero-width character removal, NFC/NFD normalization, and diacritic reordering (with Cython acceleration)
- Zawgyi Detection: Built-in detection and warning for legacy Zawgyi encoded text, with automatic conversion
- Phonetic Hashing: Sound-based similarity matching for Myanmar characters, powering homophone detection
- Colloquial Variants: Detection of informal spellings (e.g., ကျနော် → ကျွန်တော်) with configurable strictness (
strict,lenient,off) - Tone Processing: Tone mark validation, disambiguation, and context-based correction
- Bilingual Error Messages: Error reporting in English and Myanmar (မြန်မာ) via i18n system
Performance & Production
- 11 Cython/C++ Extensions: Performance-critical paths (normalization, edit distance, batch processing, Viterbi, word segmentation) compiled to C++ with OpenMP parallelization
- Streaming & Batch APIs:
check_batchfor parallel processing andcheck_asyncfor non-blocking operations - Configurable Profiles: Pre-defined profiles (
DEFAULT,FAST,ACCURATE) or custom configuration with environment/file-based loading - Connection Pooling: Thread-safe SQLite connection management for multi-threaded applications
- DI Container: Dependency injection for advanced component wiring and testability
Feature Matrix
| Feature | Method | Speed |
|---|---|---|
| Syllable Validation | Rule-based + dictionary | Fast |
| Word Validation | SymSpell (O(1) symmetric delete) | Fast |
| Context Checking | Bigram/Trigram N-gram | Moderate |
| Grammar Checking | POS + YAML rules | Moderate |
| Grammar Checkers | Aspect/Classifier/Compound/MergedWord/Negation/Register | Fast |
| NER | Heuristic + Transformer | Fast to Slow |
| Semantic Checking | AI masked language model (ONNX) | Slow |
| Batch Processing | Parallel processing | Varies |
For measured end-to-end performance (F1 96.2% without semantic, 98.3% with semantic v2.3), see the benchmarks page.
Acknowledgments
Models & Resources
| Resource | Author | Description |
|---|---|---|
| Myanmar POS Model | Chuu Htet Naing | XLM-RoBERTa-based POS tagger (93.37% accuracy) |
| myWord Segmentation | Ye Kyaw Thu | Viterbi-based Myanmar word segmentation |
| CRF Word Segmenter | Ye Kyaw Thu | CRF-based syllable-to-word segmentation model |
Key Dependencies
| Library | Purpose | License |
|---|---|---|
| pycrfsuite | CRF model inference | MIT |
| transformers | Transformer model inference | Apache 2.0 |
Algorithm References
| Algorithm | Author | Description |
|---|---|---|
| SymSpell | Wolf Garbe | Symmetric delete spelling correction. mySpellChecker includes a custom implementation with Myanmar-specific variant generation. |