Skip to main content
Most spell checkers split text on spaces. Myanmar doesn’t have spaces, so that approach fails completely. mySpellChecker works around this by starting from syllables, which can be identified without a dictionary, and building up from there: syllable validation catches ~90% of errors cheaply, then word lookup, grammar rules, and AI handle the rest. You build your own dictionary from a text corpus, train optional AI models on your own data, and get a checking pipeline that adapts to your domain.
v1.0 supports Standard Burmese (Myanmar) only. Other Myanmar-script languages (Shan, Karen, Mon, etc.) are planned for future releases.
pip install myspellchecker

Quickstart

Install, build a dictionary, and check your first text in 5 minutes.

How It Works, Layer by Layer

You can’t check what you can’t split. So mySpellChecker splits smaller, starting from syllables, the smallest reliable unit, then builds up layer by layer until nothing gets through.
English: "I am reading a book"     → 5 words (split on spaces)
Myanmar: "ကျွန်တော်စာဖတ်နေပါတယ်"  → Where are the word boundaries?
Most spell checkers assume whitespace = word boundary. Myanmar has none. Instead of attempting expensive word segmentation on potentially incorrect text, the pipeline processes text through four progressive layers, where each layer catches what the one before it can’t:

Layer 1: Syllable Validation

22 structural rules, no dictionary needed. Catches invalid syllable structures, medial ordering and compatibility, tone mark errors, virama and stacking issues, kinzi patterns, vowel exclusivity, diacritic uniqueness, Great Sa rules, particle typos, and corruption detection. This alone catches ~90% of errors in O(1) time.

Layer 2: Word Validation

SymSpell O(1) symmetric delete lookup with a neural reranker (19-feature ONNX MLP). Handles out-of-vocabulary words, compound word resolution, ambiguous segmentation, morphological root recovery, colloquial variant detection, phonetic similarity matching, and Zawgyi detection. Only runs on text that passed syllable validation.

Layer 2.5: Grammar Rules

POS tagging (3 pluggable backends: rule-based, Viterbi HMM, Transformer) plus 8 grammar checkers driven by YAML rule files. Catches register mixing (colloquial/formal), aspect marker errors, classifier-noun agreement, negation pattern mismatches, compound formation issues, merged word/particle detection, particle context errors, and tense-time agreement violations.

Layer 3: Context & AI

12 priority-ordered validation strategies combining N-gram context, statistical confusable detection, MLP classification, phonetic analysis, and custom-trained AI models. Catches homophones, confusable variants (statistical + semantic via MLM), broken compounds, question structure errors, POS sequence violations, tone ambiguity, and deep semantic errors. AI strategies are opt-in and require trained models.
Faster rules always run first. Each layer builds on validated output from the layer below, and positions already flagged are skipped by later strategies. This means most errors are caught cheaply, and expensive operations (N-gram lookups, AI inference) only run on text that has already passed basic validation.

Why Myanmar Spell Checking Is Hard

Building a spell checker for Myanmar is fundamentally different from English. These are the challenges that made this problem unsolved for decades and that shaped every design decision in mySpellChecker.
Most spell checkers assume whitespace = word boundary. Myanmar text is a continuous stream of characters with no delimiters. You can’t even begin checking until you figure out where one word ends and the next begins. mySpellChecker starts from syllables, the smallest reliable unit, and builds up.
Without spaces, the same character sequence can be segmented into different valid words with different meanings. Only context tells you which split is correct. This makes every stage of the pipeline, from syllable grouping to word lookup, dependent on disambiguation.
Myanmar syllable structure has strict rules: consonant, optional medials, vowel, optional tone. A syllable can pass every structural check and still form a word that doesn’t exist. Syllable validation catches invalid structures immediately, but real-word errors need dictionary lookup, grammar rules, and context analysis.
Myanmar has four medial consonants (ျ ြ ွ ှ) that attach to base characters. The ya-pin (ျ) and ya-yit (ြ) look nearly identical but produce completely different words: ကျောင်း (school) vs ကြောင်း (reason). This is the most common Myanmar typo, and mySpellChecker has 49+ dedicated correction rules for medial confusion alone.
These aren’t just visual typos; they’re phonetically identical or near-identical words with different meanings. A simple dictionary lookup won’t catch them because both words exist. You need phonetic hashing, frequency analysis, and context to pick the right one.
Every word in the sentence is a real word. The sentence is still wrong. Real-word errors are the hardest class of spelling mistakes in any language. Catching these requires N-gram probabilities, POS sequence validation, or AI, which means layer 3 and beyond in the pipeline.
Myanmar negation uses a circumfix, which is a prefix (မ) and a matching sentence-final particle (ဘူး) that wrap around the verb. Drop or mismatch either part and the sentence is grammatically broken, even though every individual word is valid. This pattern is rare across world languages and requires structural grammar checking, not just dictionary lookup.
Neither form is “wrong” on its own, but mixing them in one sentence is a grammar error. A news article using colloquial endings, or a chat message using literary forms, both need flagging. You need POS tagging and register-aware grammar rules, not just a dictionary.
Myanmar builds words by combining simpler ones: စာ (text) + အုပ် (bundle) = စာအုပ် (book). But the number of possible combinations explodes. You can’t just check each part independently; you need a compound resolution algorithm to know which combinations actually exist.
Myanmar uses reduplication as a grammatical pattern: လှလှ (very beautiful), ခဏခဏ (frequently). A naive spell checker would flag these as copy-paste errors. You need grammar rules that recognize reduplication as intentional and know which words can reduplicate.
Words like သမ္မတ (president) use subscript stacked consonants that follow Pali/Sanskrit rules, not native Myanmar rules. The spell checker needs two rule systems running in parallel: one for native words and one for loanwords.
English uses capitals to signal proper nouns and sentence starts. Myanmar has no such mechanism, so မြန်မာ could be the country or the adjective. Without named entity recognition (NER), the spell checker would flag people’s names and place names as errors.
~30% of Myanmar text online still uses Zawgyi, a legacy encoding that looks identical to Unicode but uses completely different code points. Even within Unicode, the same character can have multiple valid code point orderings. Text must be detected, converted, and normalized before spell checking can even begin.
Standard Levenshtein treats all swaps equally. But in Myanmar, medial confusion (ျ↔ြ) accounts for ~30% of all errors and should cost 0.2, while a cross-class swap costs 0.5. Without Myanmar-weighted edit distance, the most common corrections rank equal to implausible ones. mySpellChecker uses a custom cost matrix derived from corpus analysis so plausible errors always surface first.
Edit distance finds the nearest word, but the nearest word may be grammatically impossible in context. A misspelled verb might have a noun as its closest spelling match. You need POS tagging and bigram probabilities to promote the candidate that actually fits the sentence, even if it’s farther in edit distance.
Myanmar speech merges distinctions that the writing system preserves. Aspirated pairs (က/ခ, စ/ဆ), three nasal endings (န်, မ်, ံ) all sounding /n/, three stop endings (က်, တ်, ပ်) all becoming a glottal stop. The corrector must know which written form is correct when they all sound identical, using frequency, context, and semantic analysis.

Key Features

12-Strategy Validation Pipeline

The checking pipeline runs up to 12 composable strategies in priority order. Each strategy builds on the output of previous ones, and positions already flagged are skipped by later strategies:
PriorityStrategyMethodSpeedWhat It Catches
10Tone ValidationRule-basedFastTone mark errors and disambiguation
15OrthographyRule-basedFastMedial order, compatibility violations
20Syntactic RuleYAML rulesFastGrammar rule violations
24Statistical ConfusableBigram ratioFastBigram-based confusable word detection
25Broken CompoundDictionaryFastWrongly split compound words
30POS SequencePOS taggerModerateInvalid POS tag sequences
40QuestionPattern matchingFastQuestion structure errors
45HomophoneN-gram + frequencyModerateSound-alike word confusion
47Confusable Compound ClassifierMLP (ONNX)FastMLP-based confusable/compound detection
48Confusable SemanticAI MLMSlowMLM-enhanced confusable detection
50N-gram ContextBigram/TrigramModerateReal-word errors (correct spelling, wrong context)
70SemanticAI masked language modelSlowDeep context errors (ONNX model)
Strategies 10-50 are rule-based and run by default. Strategies 47, 48, and 70 are AI-powered and require trained models.

Syllable Validation

The foundation layer. Uses SyllableRuleValidator (with Cython acceleration) plus dictionary lookup to validate Myanmar character combinations against orthographic rules. Catches invalid medial stacking, tone mark placement, and character sequences in O(1) time, before any dictionary or AI is needed. See Syllable Validation.

Word Validation with SymSpell

Uses the SymSpell symmetric delete algorithm for O(1) correction suggestions, which is 1,000x faster than traditional Levenshtein search. Pre-computes deletion candidates at build time so lookups are hash table hits, not edit distance calculations. Includes Myanmar-specific substitution costs that weight medial confusion (ျ↔ြ) lower than unrelated character swaps. See Word Validation and SymSpell Algorithm.

Context Checking & Homophones

N-gram Context uses bigram/trigram probability tables to detect real-word errors, which are words that are spelled correctly but wrong for the context. For example, “နီ” (red) vs “နေ” (stay) are both valid words, but only one fits “ဘာလုပ်___လဲ” (what are you doing?). Homophone Detection extends this with bidirectional N-gram analysis (checks both forward and backward context) and frequency-aware guards that prevent high-frequency words from being incorrectly flagged. See Context Checking and Homophones.

Grammar Checking

Eight specialized grammar checkers, each handling a different aspect of Myanmar grammar:
CheckerWhat It CatchesExample
AspectAspect marker misuseပြီနေ (“completed + progressive”), invalid sequence
ClassifierClassifier-noun agreementခွေးဦး (“dog + polite-people classifier”) → ခွေးကောင် (“dog + animal classifier”)
CompoundCompound word errorsပန်ခြံ (“missing tone”) → ပန်းခြံ (“flower garden”)
MergedWordParticle-verb mergingသွားကို (“verb + object particle merged”), should be separate
NegationNegation pattern errorsမသွားတယ် (“negation + affirmative ending”) → မသွားဘူး (“negation + negative ending”)
ParticleParticle context errorsWrong particle usage based on surrounding POS context
TenseAgreementTense-time agreementTense marker contradicts temporal context
RegisterFormal/informal mixingငါသွားပါသည် (“colloquial pronoun + formal ending”), register clash
All rules are YAML-driven and customizable. See Grammar Checking and Grammar Checkers.

POS Tagging

Pluggable part-of-speech tagging with three backends:
BackendSpeedUse Case
Rule-BasedFastSimple validation, resource-constrained
Viterbi HMMMediumBalanced accuracy and speed
TransformerSlowMaximum accuracy, ~93% (requires transformers package)
Includes joint segmentation + tagging, which performs word breaking and POS tagging in a single pass, avoiding error propagation. See POS Tagging.

AI Semantic Checking

Trains a masked language model from scratch on your corpus. At inference, masks each word and asks “what should go here?”. If the model strongly disagrees with the original word, it flags a semantic error and suggests alternatives. Handles Myanmar-specific challenges: word-aligned multi-token masking for BPE-split words, beam search for multi-token prediction, and per-model confidence calibration (XLM-RoBERTa, BERT, DistilBERT). See Semantic Checking and Semantic Algorithm.

Compound & Morpheme Handling

  • CompoundResolver: DP-based compound word synthesis that breaks OOV words into known components
  • ReduplicationEngine: Validates productive reduplication patterns (e.g., ရှင်းရှင်းလင်းလင်း)
  • Morpheme-level correction: Corrects individual morphemes within compound words instead of replacing the entire word
See Morphology.

Named Entity Recognition

Reduces false positives by identifying names and places before spell checking. Three implementations: heuristic (fast), CRF-based, and Transformer-based (93% accuracy). NER-flagged tokens are skipped by downstream validation strategies. See NER.

Dictionary Building Pipeline

Build custom dictionaries from your own text corpora:
  • Multi-Format Ingestion: .txt, .csv, .tsv, .json, .jsonl, .parquet
  • Parallel Processing: Cython + OpenMP batch processor for fast segmentation
  • N-gram Frequency: Bigram/Trigram probability tables for context checking
  • Incremental Builds: Resume processing without reprocessing completed files
  • Pluggable Storage: SQLite (default, disk-based) or MemoryProvider (RAM-based) with thread-safe connection pooling
See Dictionary Building.

AI Model Training

Two end-to-end training pipelines that handle tokenizer creation, model training, and ONNX export with INT8 quantization:
PipelineModel TypeBaseInferenceCLI
SemanticMasked Language ModelTrain from scratch (RoBERTa/BERT)~200mstrain-model
Neural RerankerMLP (19-feature)Train on synthetic errors~50usreranker_trainer
The semantic model handles detection, suggestion scoring, and context validation via MLM objectives with optional word-boundary-aware masking. The neural reranker learns to re-order spelling suggestions using 20 extracted features (edit distance, frequency, phonetic similarity, etc.). Both pipelines support streaming for large corpora and ONNX export. See Training Guide.

Myanmar Language Support

  • Text Normalization: Unified service for zero-width character removal, NFC/NFD normalization, and diacritic reordering (with Cython acceleration)
  • Zawgyi Detection: Built-in detection and warning for legacy Zawgyi encoded text, with automatic conversion
  • Phonetic Hashing: Sound-based similarity matching for Myanmar characters, powering homophone detection
  • Colloquial Variants: Detection of informal spellings (e.g., ကျနော် → ကျွန်တော်) with configurable strictness (strict, lenient, off)
  • Tone Processing: Tone mark validation, disambiguation, and context-based correction
  • Bilingual Error Messages: Error reporting in English and Myanmar (မြန်မာ) via i18n system

Performance & Production

  • 11 Cython/C++ Extensions: Performance-critical paths (normalization, edit distance, batch processing, Viterbi, word segmentation) compiled to C++ with OpenMP parallelization
  • Streaming & Batch APIs: check_batch for parallel processing and check_async for non-blocking operations
  • Configurable Profiles: Pre-defined profiles (DEFAULT, FAST, ACCURATE) or custom configuration with environment/file-based loading
  • Connection Pooling: Thread-safe SQLite connection management for multi-threaded applications
  • DI Container: Dependency injection for advanced component wiring and testability
See Performance Tuning, Streaming, and Configuration.

Feature Matrix

FeatureMethodSpeed
Syllable ValidationRule-based + dictionaryFast
Word ValidationSymSpell (O(1) symmetric delete)Fast
Context CheckingBigram/Trigram N-gramModerate
Grammar CheckingPOS + YAML rulesModerate
Grammar CheckersAspect/Classifier/Compound/MergedWord/Negation/RegisterFast
NERHeuristic + TransformerFast to Slow
Semantic CheckingAI masked language model (ONNX)Slow
Batch ProcessingParallel processingVaries
For measured end-to-end performance (F1 96.2% without semantic, 98.3% with semantic v2.3), see the benchmarks page.

Acknowledgments

Models & Resources

ResourceAuthorDescription
Myanmar POS ModelChuu Htet NaingXLM-RoBERTa-based POS tagger (93.37% accuracy)
myWord SegmentationYe Kyaw ThuViterbi-based Myanmar word segmentation
CRF Word SegmenterYe Kyaw ThuCRF-based syllable-to-word segmentation model

Key Dependencies

LibraryPurposeLicense
pycrfsuiteCRF model inferenceMIT
transformersTransformer model inferenceApache 2.0

Algorithm References

AlgorithmAuthorDescription
SymSpellWolf GarbeSymmetric delete spelling correction. mySpellChecker includes a custom implementation with Myanmar-specific variant generation.