Skip to main content
Validating individual words is not enough. Errors often occur when a word is spelled correctly but used incorrectly in context. mySpellChecker employs two strategies for this: Syntactic Grammar Checking (Layer 2.5) and N-gram Probability (Layer 3). Example (Myanmar homophone confusion):
“သူ စား ချင်တယ်။” (He wants to eat.) — Correct “သူ စာ ချင်တယ်။” (He letter wants-to — ungrammatical; ချင် requires a preceding verb) — Spelling correct, context incorrect
Another example:
“ကျောင်း သား တစ်ယောက်” (A student) — Correct “ကျောင်း သာ တစ်ယောက်” (School only one person) — Grammatically awkward

Syntactic Grammar Checking

This layer uses Part-of-Speech (POS) tagging and deterministic rules to catch grammatical errors that statistical models might miss due to data sparsity.

How it Works

  1. POS Tagging: Every word in the dictionary can optionally have a POS tag (e.g., N for Noun, V for Verb, P for Particle).
  2. Rule Engine: A set of linguistic rules defines valid and invalid sequences.

Example Rules

  • Verb + Particle Agreement:
    • Invalid: သွား (Go/Verb) + ကျောင်း (School/Noun) → “Go School” (Grammatically awkward, missing particle)
    • Correction: ကျောင်းသွား or ကျောင်းကို သွား → “Go to school”
  • Nominalizer Particle ကြောင်း vs Noun ကျောင်း:
    • သွားကြောင်း ပြောတယ် (Said that [he] went) — Correct: ကြောင်း nominalizes the verb
    • သွားကျောင်း ပြောတယ် — Invalid: ကျောင်း (school) cannot follow a verb directly
  • Particle Selection (မှာ vs မှ):
    • ရုံးမှာ ရှိတယ် (Is at the office) — Correct: မှာ indicates location
    • ရုံးမှ လာတယ် (Came from the office) — Correct: မှ indicates origin
    • Rule: After Noun, မှာ typically means “at/in”. မှ typically means “from” or marks conditional.
  • Subject Marker Agreement (က vs ကို):
    • သူက စာအုပ်ကို ဖတ်တယ် (He reads the book) — Correct
    • သူကို စာအုပ်က ဖတ်တယ် (The book reads him) — Semantically incorrect
    • Rule: Animate subjects typically take က; objects take ကို
  • Question Particle Matching:
    • ဘယ်သူလဲ (Who is it?) — Correct: ဘယ် question word + လဲ particle
    • ဘယ်သူလား — Also valid but different nuance (softer question)
    • ဘာလဲ vs ဘာပဲ — Different meanings: “What?” vs “Whatever”

N-gram Probability

mySpellChecker uses N-gram models (Bigrams and Trigrams) to calculate the probability of word sequences.
  1. Bigram: Probability of Word B following Word A (P(BA)P(B|A)).
  2. Trigram: Probability of Word C following A and B (P(CA,B)P(C|A,B)).

The Algorithm

  1. Detection: When the checker encounters a sequence of words, it queries the database for the frequency of that sequence. If P(WordiWordi1)P(Word_i | Word_{i-1}) is below bigram_threshold, the word is flagged as suspicious.
  2. Correction: The system generates candidates for the suspicious word (using SymSpell or Phonetic matching). It then re-calculates probabilities for each candidate in the sentence. Example: Input sentence “သူ စာ ချင်တယ်” (suspicious word: စာ)
    • Candidate “စား” (eat): P(စားသူ)=0.08P(\text{စား} | \text{သူ}) = 0.08 (High — common verb after pronoun)
    • Candidate “စာ” (letter): P(စာသူ)=0.002P(\text{စာ} | \text{သူ}) = 0.002 (Low — less common as standalone)
    The system suggests “စား” because it fits the context better with the verb-wanting pattern “ချင်တယ်”.

Advanced Strategies

The N-gram checker employs several heuristics to handle unseen data and improve accuracy:

1. Backoff Smoothing (Unigram Check)

If a bigram probability is zero (unseen sequence), the checker looks at the unigram frequency of the word.
  • If the word is very common globally (high unigram frequency), we assume it is likely correct but used in a novel context. It is not flagged as an error.
  • If the word is rare, it is more likely to be a typo.
Example:
  • Input: မြန်မာ ဂီတ (Myanmar music) — bigram unseen in corpus
  • ဂီတ has high unigram frequency (common word for “music”)
  • Result: Not flagged as error, assumed to be valid novel combination

2. Typo Heuristic

For unseen rare words, the checker searches for “neighbors” (words with Edit Distance = 1) that fit the context with high probability.
  • If a neighbor has a high bigram probability (P>threshold×10P > \text{threshold} \times 10), we assume the current word is a typo of that neighbor and flag it.
Example:
  • Input: စာအုပ် ဖတ်တတ် (rare/unseen word ဖတ်တတ်)
  • Neighbor found: ဖတ်တယ် (reads) — Edit Distance = 1
  • P(ဖတ်တယ်စာအုပ်)=0.15P(\text{ဖတ်တယ်} | \text{စာအုပ်}) = 0.15 (high bigram probability)
  • Result: Flag ဖတ်တတ် as likely typo of ဖတ်တယ်

Tone Disambiguation

In Myanmar language, tone marks (, ) drastically change the meaning of a word. Many spelling errors involve missing or incorrect tone marks (e.g., ငါ vs ငါး). The ToneDisambiguator module uses a specialized context window to resolve these ambiguities.

How it Works

It maintains a list of Ambiguous Groups (e.g., the “Three Tones of Ka”). When it encounters a word from such a group, it checks the surrounding +/- 3 words against a set of context patterns. Example 1: သံ (Sound/Iron) vs သုံး (Three)
  • Input: သံ ယောက် (Iron person?)
  • Context: ယောက် (classifier for people) follows the word.
  • Pattern Match: The pattern ("ယောက်", "ခု", "လုံး") is associated with the number သုံး (Three).
  • Correction: သုံး ယောက် (Three people).
Example 2: ငါ (I/me) vs ငါး (Fish/Five)
  • Input: ငါ ကောင် (I animal?)
  • Context: ကောင် (classifier for animals) follows the word.
  • Pattern Match: Classifiers for counting animals/fish follow numbers.
  • Correction: ငါး ကောင် (Five animals/fish).
Example 3: (Beginning) vs စ့ (Pierce) vs စာ (Letter)
  • Input: အစာ စား vs အစ စား
  • Context: စား (eat) follows — eating requires food (အစာ)
  • Correction: အစာ စား (Eat food) — not အစ စား (Eat beginning)
Example 4: ကြ (Plural marker) vs ကြီး (Big)
  • Input: သူတို့ သွားကြီး (They go big?)
  • Context: သူတို့ (they) is a plural pronoun, expects plural verb marker
  • Correction: သူတို့ သွားကြ (They go) — plural marker ကြ after verb
This system operates alongside the N-gram checker but provides higher confidence for specific, well-known ambiguity patterns.

Common Myanmar Grammar Errors Detected

Error TypeExample (Incorrect)CorrectionRule Applied
Homophone confusionစာ ချင်တယ်စား ချင်တယ်Context: verb pattern
Missing particleကျောင်း သွားကျောင်းကို သွားVerb requires object marker
Wrong particleရုံးမှာ လာတယ်ရုံးမှ လာတယ်Motion verb needs မှ (from)
Tone mark errorသုံ ယောက်သုံး ယောက်Classifier context
Plural markerသူတို့ သွားတယ်သူတို့ သွားကြတယ်Plural subject agreement
Nominalizerသွားကျောင်းသွားကြောင်းVerb nominalization