Overview - mySpellChecker

Most spell checkers split text on spaces. Myanmar doesn’t have spaces, so that approach fails completely. mySpellChecker works around this by starting from syllables, which can be identified without a dictionary, and building up from there: syllable validation catches ~90% of errors cheaply, then word lookup, grammar rules, and AI handle the rest. You build your own dictionary from a text corpus, train optional AI models on your own data, and get a checking pipeline that adapts to your domain.

v1.0 supports Standard Burmese (Myanmar) only. Other Myanmar-script languages (Shan, Karen, Mon, etc.) are planned for future releases.

pip install myspellchecker

pip install myspellchecker[ai]

pip install "myspellchecker[ai,train,transformers]"

Quickstart

Install, build a dictionary, and check your first text in 5 minutes.

How It Works, Layer by Layer

You can’t check what you can’t split. So mySpellChecker splits smaller, starting from syllables, the smallest reliable unit, then builds up layer by layer until nothing gets through.

English: "I am reading a book"     → 5 words (split on spaces)
Myanmar: "ကျွန်တော်စာဖတ်နေပါတယ်"  → Where are the word boundaries?

Most spell checkers assume whitespace = word boundary. Myanmar has none. Instead of attempting expensive word segmentation on potentially incorrect text, the pipeline processes text through four progressive layers, where each layer catches what the one before it can’t:

Layer 1: Syllable Validation

22 structural rules, no dictionary needed. Catches invalid syllable structures, medial ordering and compatibility, tone mark errors, virama and stacking issues, kinzi patterns, vowel exclusivity, diacritic uniqueness, Great Sa rules, particle typos, and corruption detection. This alone catches ~90% of errors in O(1) time.

Layer 2: Word Validation

SymSpell O(1) symmetric delete lookup with a neural reranker (19-feature ONNX MLP). Handles out-of-vocabulary words, compound word resolution, ambiguous segmentation, morphological root recovery, colloquial variant detection, phonetic similarity matching, and Zawgyi detection. Only runs on text that passed syllable validation.

Layer 2.5: Grammar Rules

POS tagging (3 pluggable backends: rule-based, Viterbi HMM, Transformer) plus 8 grammar checkers driven by YAML rule files. Catches register mixing (colloquial/formal), aspect marker errors, classifier-noun agreement, negation pattern mismatches, compound formation issues, merged word/particle detection, particle context errors, and tense-time agreement violations.

Layer 3: Context & AI

12 priority-ordered validation strategies combining N-gram context, statistical confusable detection, MLP classification, phonetic analysis, and custom-trained AI models. Catches homophones, confusable variants (statistical + semantic via MLM), broken compounds, question structure errors, POS sequence violations, tone ambiguity, and deep semantic errors. AI strategies are opt-in and require trained models.

Faster rules always run first. Each layer builds on validated output from the layer below, and positions already flagged are skipped by later strategies. This means most errors are caught cheaply, and expensive operations (N-gram lookups, AI inference) only run on text that has already passed basic validation.

Why Myanmar Spell Checking Is Hard

Building a spell checker for Myanmar is fundamentally different from English. These are the challenges that made this problem unsolved for decades and that shaped every design decision in mySpellChecker.

No word boundaries

Most spell checkers assume whitespace = word boundary. Myanmar text is a continuous stream of characters with no delimiters. You can’t even begin checking until you figure out where one word ends and the next begins. mySpellChecker starts from syllables, the smallest reliable unit, and builds up.

Ambiguous segmentation

Without spaces, the same character sequence can be segmented into different valid words with different meanings. Only context tells you which split is correct. This makes every stage of the pipeline, from syllable grouping to word lookup, dependent on disambiguation.

Valid syllable ≠ Valid word

Myanmar syllable structure has strict rules: consonant, optional medials, vowel, optional tone. A syllable can pass every structural check and still form a word that doesn’t exist. Syllable validation catches invalid structures immediately, but real-word errors need dictionary lookup, grammar rules, and context analysis.

Look-alike medials (ျ vs ြ)

Myanmar has four medial consonants (ျ ြ ွ ှ) that attach to base characters. The ya-pin (ျ) and ya-yit (ြ) look nearly identical but produce completely different words: ကျောင်း (school) vs ကြောင်း (reason). This is the most common Myanmar typo, and mySpellChecker has 49+ dedicated correction rules for medial confusion alone.

Confusables: sound the same, mean something different

These aren’t just visual typos; they’re phonetically identical or near-identical words with different meanings. A simple dictionary lookup won’t catch them because both words exist. You need phonetic hashing, frequency analysis, and context to pick the right one.

Right word, wrong place (real-word errors)

Every word in the sentence is a real word. The sentence is still wrong. Real-word errors are the hardest class of spelling mistakes in any language. Catching these requires N-gram probabilities, POS sequence validation, or AI, which means layer 3 and beyond in the pipeline.

Negation wraps around (circumfix)

Myanmar negation uses a circumfix, which is a prefix (မ) and a matching sentence-final particle (ဘူး) that wrap around the verb. Drop or mismatch either part and the sentence is grammatically broken, even though every individual word is valid. This pattern is rare across world languages and requires structural grammar checking, not just dictionary lookup.

Colloquial vs Formal register mixing

Neither form is “wrong” on its own, but mixing them in one sentence is a grammar error. A news article using colloquial endings, or a chat message using literary forms, both need flagging. You need POS tagging and register-aware grammar rules, not just a dictionary.

Compound word explosion

Myanmar builds words by combining simpler ones: စာ (text) + အုပ် (bundle) = စာအုပ် (book). But the number of possible combinations explodes. You can’t just check each part independently; you need a compound resolution algorithm to know which combinations actually exist.

Reduplication is grammar, not a typo

Myanmar uses reduplication as a grammatical pattern: လှလှ (very beautiful), ခဏခဏ (frequently). A naive spell checker would flag these as copy-paste errors. You need grammar rules that recognize reduplication as intentional and know which words can reduplicate.

Pali & Sanskrit loanwords

Words like သမ္မတ (president) use subscript stacked consonants that follow Pali/Sanskrit rules, not native Myanmar rules. The spell checker needs two rule systems running in parallel: one for native words and one for loanwords.

No capitalization, so names look like common words

English uses capitals to signal proper nouns and sentence starts. Myanmar has no such mechanism, so မြန်မာ could be the country or the adjective. Without named entity recognition (NER), the spell checker would flag people’s names and place names as errors.

Encoding chaos (Unicode vs Zawgyi)

~30% of Myanmar text online still uses Zawgyi, a legacy encoding that looks identical to Unicode but uses completely different code points. Even within Unicode, the same character can have multiple valid code point orderings. Text must be detected, converted, and normalized before spell checking can even begin.

Edit distance lies for Myanmar

Standard Levenshtein treats all swaps equally. But in Myanmar, medial confusion (ျ↔ြ) accounts for ~30% of all errors and should cost 0.2, while a cross-class swap costs 0.5. Without Myanmar-weighted edit distance, the most common corrections rank equal to implausible ones. mySpellChecker uses a custom cost matrix derived from corpus analysis so plausible errors always surface first.

Grammar beats spelling in suggestion ranking

Edit distance finds the nearest word, but the nearest word may be grammatically impossible in context. A misspelled verb might have a noun as its closest spelling match. You need POS tagging and bigram probabilities to promote the candidate that actually fits the sentence, even if it’s farther in edit distance.

Sounds collapse, spellings don't

Myanmar speech merges distinctions that the writing system preserves. Aspirated pairs (က/ခ, စ/ဆ), three nasal endings (န်, မ်, ံ) all sounding /n/, three stop endings (က်, တ်, ပ်) all becoming a glottal stop. The corrector must know which written form is correct when they all sound identical, using frequency, context, and semantic analysis.

Key Features

12-Strategy Validation Pipeline

The checking pipeline runs up to 12 composable strategies in priority order. Each strategy builds on the output of previous ones, and positions already flagged are skipped by later strategies:

Priority	Strategy	Method	Speed	What It Catches
10	Tone Validation	Rule-based	Fast	Tone mark errors and disambiguation
15	Orthography	Rule-based	Fast	Medial order, compatibility violations
20	Syntactic Rule	YAML rules	Fast	Grammar rule violations
24	Statistical Confusable	Bigram ratio	Fast	Bigram-based confusable word detection
25	Broken Compound	Dictionary	Fast	Wrongly split compound words
30	POS Sequence	POS tagger	Moderate	Invalid POS tag sequences
40	Question	Pattern matching	Fast	Question structure errors
45	Homophone	N-gram + frequency	Moderate	Sound-alike word confusion
47	Confusable Compound Classifier	MLP (ONNX)	Fast	MLP-based confusable/compound detection
48	Confusable Semantic	AI MLM	Slow	MLM-enhanced confusable detection
50	N-gram Context	Bigram/Trigram	Moderate	Real-word errors (correct spelling, wrong context)
70	Semantic	AI masked language model	Slow	Deep context errors (ONNX model)

Strategies 10-50 are rule-based and run by default. Strategies 47, 48, and 70 are AI-powered and require trained models.

Syllable Validation

The foundation layer. Uses SyllableRuleValidator (with Cython acceleration) plus dictionary lookup to validate Myanmar character combinations against orthographic rules. Catches invalid medial stacking, tone mark placement, and character sequences in O(1) time, before any dictionary or AI is needed. See Syllable Validation.

Word Validation with SymSpell

Uses the SymSpell symmetric delete algorithm for O(1) correction suggestions, which is 1,000x faster than traditional Levenshtein search. Pre-computes deletion candidates at build time so lookups are hash table hits, not edit distance calculations. Includes Myanmar-specific substitution costs that weight medial confusion (ျ↔ြ) lower than unrelated character swaps. See Word Validation and SymSpell Algorithm.

Context Checking & Homophones

N-gram Context uses bigram/trigram probability tables to detect real-word errors, which are words that are spelled correctly but wrong for the context. For example, “နီ” (red) vs “နေ” (stay) are both valid words, but only one fits “ဘာလုပ်___လဲ” (what are you doing?). Homophone Detection extends this with bidirectional N-gram analysis (checks both forward and backward context) and frequency-aware guards that prevent high-frequency words from being incorrectly flagged. See Context Checking and Homophones.

Grammar Checking

Eight specialized grammar checkers, each handling a different aspect of Myanmar grammar:

Checker	What It Catches	Example
Aspect	Aspect marker misuse	ပြီနေ (“completed + progressive”), invalid sequence
Classifier	Classifier-noun agreement	ခွေးဦး (“dog + polite-people classifier”) → ခွေးကောင် (“dog + animal classifier”)
Compound	Compound word errors	ပန်ခြံ (“missing tone”) → ပန်းခြံ (“flower garden”)
MergedWord	Particle-verb merging	သွားကို (“verb + object particle merged”), should be separate
Negation	Negation pattern errors	မသွားတယ် (“negation + affirmative ending”) → မသွားဘူး (“negation + negative ending”)
Particle	Particle context errors	Wrong particle usage based on surrounding POS context
TenseAgreement	Tense-time agreement	Tense marker contradicts temporal context
Register	Formal/informal mixing	ငါသွားပါသည် (“colloquial pronoun + formal ending”), register clash

All rules are YAML-driven and customizable. See Grammar Checking and Grammar Checkers.

POS Tagging

Pluggable part-of-speech tagging with three backends:

Backend	Speed	Use Case
Rule-Based	Fast	Simple validation, resource-constrained
Viterbi HMM	Medium	Balanced accuracy and speed
Transformer	Slow	Maximum accuracy, ~93% (requires `transformers` package)

Includes joint segmentation + tagging, which performs word breaking and POS tagging in a single pass, avoiding error propagation. See POS Tagging.

AI Semantic Checking

Trains a masked language model from scratch on your corpus. At inference, masks each word and asks “what should go here?”. If the model strongly disagrees with the original word, it flags a semantic error and suggests alternatives. Handles Myanmar-specific challenges: word-aligned multi-token masking for BPE-split words, beam search for multi-token prediction, and per-model confidence calibration (XLM-RoBERTa, BERT, DistilBERT). See Semantic Checking and Semantic Algorithm.

Compound & Morpheme Handling

CompoundResolver: DP-based compound word synthesis that breaks OOV words into known components
ReduplicationEngine: Validates productive reduplication patterns (e.g., ရှင်းရှင်းလင်းလင်း)
Morpheme-level correction: Corrects individual morphemes within compound words instead of replacing the entire word

See Morphology.

Named Entity Recognition

Reduces false positives by identifying names and places before spell checking. Three implementations: heuristic (fast), CRF-based, and Transformer-based (93% accuracy). NER-flagged tokens are skipped by downstream validation strategies. See NER.

Dictionary Building Pipeline

Build custom dictionaries from your own text corpora:

Multi-Format Ingestion: .txt, .csv, .tsv, .json, .jsonl, .parquet
Parallel Processing: Cython + OpenMP batch processor for fast segmentation
N-gram Frequency: Bigram/Trigram probability tables for context checking
Incremental Builds: Resume processing without reprocessing completed files
Pluggable Storage: SQLite (default, disk-based) or MemoryProvider (RAM-based) with thread-safe connection pooling

See Dictionary Building.

AI Model Training

Two end-to-end training pipelines that handle tokenizer creation, model training, and ONNX export with INT8 quantization:

Pipeline	Model Type	Base	Inference	CLI
Semantic	Masked Language Model	Train from scratch (RoBERTa/BERT)	~200ms	`train-model`
Neural Reranker	MLP (19-feature)	Train on synthetic errors	~50us	`reranker_trainer`

The semantic model handles detection, suggestion scoring, and context validation via MLM objectives with optional word-boundary-aware masking. The neural reranker learns to re-order spelling suggestions using 20 extracted features (edit distance, frequency, phonetic similarity, etc.). Both pipelines support streaming for large corpora and ONNX export. See Training Guide.

Myanmar Language Support

Text Normalization: Unified service for zero-width character removal, NFC/NFD normalization, and diacritic reordering (with Cython acceleration)
Zawgyi Detection: Built-in detection and warning for legacy Zawgyi encoded text, with automatic conversion
Phonetic Hashing: Sound-based similarity matching for Myanmar characters, powering homophone detection
Colloquial Variants: Detection of informal spellings (e.g., ကျနော် → ကျွန်တော်) with configurable strictness (strict, lenient, off)
Tone Processing: Tone mark validation, disambiguation, and context-based correction
Bilingual Error Messages: Error reporting in English and Myanmar (မြန်မာ) via i18n system

Performance & Production

11 Cython/C++ Extensions: Performance-critical paths (normalization, edit distance, batch processing, Viterbi, word segmentation) compiled to C++ with OpenMP parallelization
Streaming & Batch APIs: check_batch for parallel processing and check_async for non-blocking operations
Configurable Profiles: Pre-defined profiles (DEFAULT, FAST, ACCURATE) or custom configuration with environment/file-based loading
Connection Pooling: Thread-safe SQLite connection management for multi-threaded applications
DI Container: Dependency injection for advanced component wiring and testability

See Performance Tuning, Streaming, and Configuration.

Feature Matrix

Feature	Method	Speed
Syllable Validation	Rule-based + dictionary	Fast
Word Validation	SymSpell (O(1) symmetric delete)	Fast
Context Checking	Bigram/Trigram N-gram	Moderate
Grammar Checking	POS + YAML rules	Moderate
Grammar Checkers	Aspect/Classifier/Compound/MergedWord/Negation/Register	Fast
NER	Heuristic + Transformer	Fast to Slow
Semantic Checking	AI masked language model (ONNX)	Slow
Batch Processing	Parallel processing	Varies

For measured end-to-end performance (F1 96.2% without semantic, 98.3% with semantic v2.3), see the benchmarks page.

Acknowledgments

Models & Resources

Resource	Author	Description
Myanmar POS Model	Chuu Htet Naing	XLM-RoBERTa-based POS tagger (93.37% accuracy)
myWord Segmentation	Ye Kyaw Thu	Viterbi-based Myanmar word segmentation
CRF Word Segmenter	Ye Kyaw Thu	CRF-based syllable-to-word segmentation model

Key Dependencies

Library	Purpose	License
pycrfsuite	CRF model inference	MIT
transformers	Transformer model inference	Apache 2.0

Algorithm References

Algorithm	Author	Description
SymSpell	Wolf Garbe	Symmetric delete spelling correction. mySpellChecker includes a custom implementation with Myanmar-specific variant generation.

Quickstart

​How It Works, Layer by Layer

​Layer 1: Syllable Validation

​Layer 2: Word Validation

​Layer 2.5: Grammar Rules

​Layer 3: Context & AI

​Why Myanmar Spell Checking Is Hard

​Key Features

​12-Strategy Validation Pipeline

​Syllable Validation

​Word Validation with SymSpell

​Context Checking & Homophones

​Grammar Checking

​POS Tagging

​AI Semantic Checking

​Compound & Morpheme Handling

​Named Entity Recognition

​Dictionary Building Pipeline

​AI Model Training

​Myanmar Language Support

​Performance & Production

​Feature Matrix

​Acknowledgments

​Models & Resources

​Key Dependencies

​Algorithm References

How It Works, Layer by Layer

Layer 1: Syllable Validation

Layer 2: Word Validation

Layer 2.5: Grammar Rules

Layer 3: Context & AI

Why Myanmar Spell Checking Is Hard

Key Features

12-Strategy Validation Pipeline

Syllable Validation

Word Validation with SymSpell

Context Checking & Homophones

Grammar Checking

POS Tagging

AI Semantic Checking

Compound & Morpheme Handling

Named Entity Recognition

Dictionary Building Pipeline

AI Model Training

Myanmar Language Support

Performance & Production

Feature Matrix

Acknowledgments

Models & Resources

Key Dependencies

Algorithm References