Most spell checkers split text on spaces. Myanmar doesn’t have spaces, so that approach fails completely. mySpellChecker works around this by starting from syllables, which can be identified without a dictionary, and building up from there: syllable validation catches ~90% of errors cheaply, then word lookup, grammar rules, and AI handle the rest. You build your own dictionary from a text corpus, train optional AI models on your own data, and get a checking pipeline that adapts to your domain.Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Quickstart
How It Works, Layer by Layer
You can’t check what you can’t split. So mySpellChecker splits smaller, starting from syllables, the smallest reliable unit, then builds up layer by layer until nothing gets through.Layer 1: Syllable Validation
22 structural rules, no dictionary needed. Catches invalid syllable structures, medial ordering and compatibility, tone mark errors, virama and stacking issues, kinzi patterns, vowel exclusivity, diacritic uniqueness, Great Sa rules, particle typos, and corruption detection. This alone catches ~90% of errors in O(1) time.Layer 2: Word Validation
SymSpell O(1) symmetric delete lookup with a neural reranker (19-feature ONNX MLP). Handles out-of-vocabulary words, compound word resolution, ambiguous segmentation, morphological root recovery, colloquial variant detection, phonetic similarity matching, and Zawgyi detection. Only runs on text that passed syllable validation.Layer 2.5: Grammar Rules
POS tagging (3 pluggable backends: rule-based, Viterbi HMM, Transformer) plus 8 grammar checkers driven by YAML rule files. Catches register mixing (colloquial/formal), aspect marker errors, classifier-noun agreement, negation pattern mismatches, compound formation issues, merged word/particle detection, particle context errors, and tense-time agreement violations.Layer 3: Context & AI
12 priority-ordered validation strategies combining N-gram context, statistical confusable detection, MLP classification, phonetic analysis, and custom-trained AI models. Catches homophones, confusable variants (statistical + semantic via MLM), broken compounds, question structure errors, POS sequence violations, tone ambiguity, and deep semantic errors. AI strategies are opt-in and require trained models.Why Myanmar Spell Checking Is Hard
Building a spell checker for Myanmar is fundamentally different from English. These are the challenges that made this problem unsolved for decades and that shaped every design decision in mySpellChecker.No word boundaries
No word boundaries
Ambiguous segmentation
Ambiguous segmentation
Valid syllable ≠ Valid word
Valid syllable ≠ Valid word
Look-alike medials (ျ vs ြ)
Look-alike medials (ျ vs ြ)
Confusables: sound the same, mean something different
Confusables: sound the same, mean something different
Right word, wrong place (real-word errors)
Right word, wrong place (real-word errors)
Negation wraps around (circumfix)
Negation wraps around (circumfix)
Colloquial vs Formal register mixing
Colloquial vs Formal register mixing
Compound word explosion
Compound word explosion
Reduplication is grammar, not a typo
Reduplication is grammar, not a typo
Pali & Sanskrit loanwords
Pali & Sanskrit loanwords
No capitalization, so names look like common words
No capitalization, so names look like common words
Encoding chaos (Unicode vs Zawgyi)
Encoding chaos (Unicode vs Zawgyi)
Edit distance lies for Myanmar
Edit distance lies for Myanmar
Grammar beats spelling in suggestion ranking
Grammar beats spelling in suggestion ranking
Sounds collapse, spellings don't
Sounds collapse, spellings don't
Key Features
12-Strategy Validation Pipeline
The checking pipeline runs up to 12 composable strategies in priority order. Each strategy builds on the output of previous ones, and positions already flagged are skipped by later strategies:| Priority | Strategy | Method | Speed | What It Catches |
|---|---|---|---|---|
| 10 | Tone Validation | Rule-based | Fast | Tone mark errors and disambiguation |
| 15 | Orthography | Rule-based | Fast | Medial order, compatibility violations |
| 20 | Syntactic Rule | YAML rules | Fast | Grammar rule violations |
| 24 | Statistical Confusable | Bigram ratio | Fast | Bigram-based confusable word detection |
| 25 | Broken Compound | Dictionary | Fast | Wrongly split compound words |
| 30 | POS Sequence | POS tagger | Moderate | Invalid POS tag sequences |
| 40 | Question | Pattern matching | Fast | Question structure errors |
| 45 | Homophone | N-gram + frequency | Moderate | Sound-alike word confusion |
| 47 | Confusable Compound Classifier | MLP (ONNX) | Fast | MLP-based confusable/compound detection |
| 48 | Confusable Semantic | AI MLM | Slow | MLM-enhanced confusable detection |
| 50 | N-gram Context | Bigram/Trigram | Moderate | Real-word errors (correct spelling, wrong context) |
| 70 | Semantic | AI masked language model | Slow | Deep context errors (ONNX model) |
Syllable Validation
The foundation layer. UsesSyllableRuleValidator (with Cython acceleration) plus dictionary lookup to validate Myanmar character combinations against orthographic rules. Catches invalid medial stacking, tone mark placement, and character sequences in O(1) time, before any dictionary or AI is needed. See Syllable Validation.
Word Validation with SymSpell
Uses the SymSpell symmetric delete algorithm for O(1) correction suggestions, which is 1,000x faster than traditional Levenshtein search. Pre-computes deletion candidates at build time so lookups are hash table hits, not edit distance calculations. Includes Myanmar-specific substitution costs that weight medial confusion (ျ↔ြ) lower than unrelated character swaps. See Word Validation and SymSpell Algorithm.Context Checking & Homophones
N-gram Context uses bigram/trigram probability tables to detect real-word errors, which are words that are spelled correctly but wrong for the context. For example, “နီ” (red) vs “နေ” (stay) are both valid words, but only one fits “ဘာလုပ်___လဲ” (what are you doing?). Homophone Detection extends this with bidirectional N-gram analysis (checks both forward and backward context) and frequency-aware guards that prevent high-frequency words from being incorrectly flagged. See Context Checking and Homophones.Grammar Checking
Eight specialized grammar checkers, each handling a different aspect of Myanmar grammar:| Checker | What It Catches | Example |
|---|---|---|
| Aspect | Aspect marker misuse | ပြီနေ (“completed + progressive”), invalid sequence |
| Classifier | Classifier-noun agreement | ခွေးဦး (“dog + polite-people classifier”) → ခွေးကောင် (“dog + animal classifier”) |
| Compound | Compound word errors | ပန်ခြံ (“missing tone”) → ပန်းခြံ (“flower garden”) |
| MergedWord | Particle-verb merging | သွားကို (“verb + object particle merged”), should be separate |
| Negation | Negation pattern errors | မသွားတယ် (“negation + affirmative ending”) → မသွားဘူး (“negation + negative ending”) |
| Particle | Particle context errors | Wrong particle usage based on surrounding POS context |
| TenseAgreement | Tense-time agreement | Tense marker contradicts temporal context |
| Register | Formal/informal mixing | ငါသွားပါသည် (“colloquial pronoun + formal ending”), register clash |
POS Tagging
Pluggable part-of-speech tagging with three backends:| Backend | Speed | Use Case |
|---|---|---|
| Rule-Based | Fast | Simple validation, resource-constrained |
| Viterbi HMM | Medium | Balanced accuracy and speed |
| Transformer | Slow | Maximum accuracy, ~93% (requires transformers package) |
AI Semantic Checking
Trains a masked language model from scratch on your corpus. At inference, masks each word and asks “what should go here?”. If the model strongly disagrees with the original word, it flags a semantic error and suggests alternatives. Handles Myanmar-specific challenges: word-aligned multi-token masking for BPE-split words, beam search for multi-token prediction, and per-model confidence calibration (XLM-RoBERTa, BERT, DistilBERT). See Semantic Checking and Semantic Algorithm.Compound & Morpheme Handling
- CompoundResolver: DP-based compound word synthesis that breaks OOV words into known components
- ReduplicationEngine: Validates productive reduplication patterns (e.g., ရှင်းရှင်းလင်းလင်း)
- Morpheme-level correction: Corrects individual morphemes within compound words instead of replacing the entire word
Named Entity Recognition
Reduces false positives by identifying names and places before spell checking. Three implementations: heuristic (fast), CRF-based, and Transformer-based (93% accuracy). NER-flagged tokens are skipped by downstream validation strategies. See NER.Dictionary Building Pipeline
Build custom dictionaries from your own text corpora:- Multi-Format Ingestion:
.txt,.csv,.tsv,.json,.jsonl,.parquet - Parallel Processing: Cython + OpenMP batch processor for fast segmentation
- N-gram Frequency: Bigram/Trigram probability tables for context checking
- Incremental Builds: Resume processing without reprocessing completed files
- Pluggable Storage: SQLite (default, disk-based) or MemoryProvider (RAM-based) with thread-safe connection pooling
AI Model Training
Two end-to-end training pipelines that handle tokenizer creation, model training, and ONNX export with INT8 quantization:| Pipeline | Model Type | Base | Inference | CLI |
|---|---|---|---|---|
| Semantic | Masked Language Model | Train from scratch (RoBERTa/BERT) | ~200ms | train-model |
| Neural Reranker | MLP (19-feature) | Train on synthetic errors | ~50us | reranker_trainer |
Myanmar Language Support
- Text Normalization: Unified service for zero-width character removal, NFC/NFD normalization, and diacritic reordering (with Cython acceleration)
- Zawgyi Detection: Built-in detection and warning for legacy Zawgyi encoded text, with automatic conversion
- Phonetic Hashing: Sound-based similarity matching for Myanmar characters, powering homophone detection
- Colloquial Variants: Detection of informal spellings (e.g., ကျနော် → ကျွန်တော်) with configurable strictness (
strict,lenient,off) - Tone Processing: Tone mark validation, disambiguation, and context-based correction
- Bilingual Error Messages: Error reporting in English and Myanmar (မြန်မာ) via i18n system
Performance & Production
- 11 Cython/C++ Extensions: Performance-critical paths (normalization, edit distance, batch processing, Viterbi, word segmentation) compiled to C++ with OpenMP parallelization
- Streaming & Batch APIs:
check_batchfor parallel processing andcheck_asyncfor non-blocking operations - Configurable Profiles: Pre-defined profiles (
DEFAULT,FAST,ACCURATE) or custom configuration with environment/file-based loading - Connection Pooling: Thread-safe SQLite connection management for multi-threaded applications
- DI Container: Dependency injection for advanced component wiring and testability
Feature Matrix
| Feature | Method | Speed |
|---|---|---|
| Syllable Validation | Rule-based + dictionary | Fast |
| Word Validation | SymSpell (O(1) symmetric delete) | Fast |
| Context Checking | Bigram/Trigram N-gram | Moderate |
| Grammar Checking | POS + YAML rules | Moderate |
| Grammar Checkers | Aspect/Classifier/Compound/MergedWord/Negation/Register | Fast |
| NER | Heuristic + Transformer | Fast to Slow |
| Semantic Checking | AI masked language model (ONNX) | Slow |
| Batch Processing | Parallel processing | Varies |
Acknowledgments
Models & Resources
| Resource | Author | Description |
|---|---|---|
| Myanmar POS Model | Chuu Htet Naing | XLM-RoBERTa-based POS tagger (93.37% accuracy) |
| myWord Segmentation | Ye Kyaw Thu | Viterbi-based Myanmar word segmentation |
| CRF Word Segmenter | Ye Kyaw Thu | CRF-based syllable-to-word segmentation model |
Key Dependencies
| Library | Purpose | License |
|---|---|---|
| pycrfsuite | CRF model inference | MIT |
| transformers | Transformer model inference | Apache 2.0 |
Algorithm References
| Algorithm | Author | Description |
|---|---|---|
| SymSpell | Wolf Garbe | Symmetric delete spelling correction. mySpellChecker includes a custom implementation with Myanmar-specific variant generation. |