Skip to main content
The diagrams below show how validation, data, utility, and algorithm layers interact within the system.

High-Level Architecture

Component Diagram

Validation Layer Components

SyllableValidator
SyllableRuleValidator
Structure rules
Medial order
Vowel compatibility
Dictionary Lookup
Syllable exists?
Frequency lookup
WordValidator
Dictionary Lookup
Word exists?
Get frequency
Get POS
SymSpell Algorithm
Generate deletes
Find suggestions
Rank by distance
ContextValidator (Strategy-based)
SyntacticValidationStrategy (Layer 2.5)
POS Tagger
Viterbi HMM
Transformer
Rule-based
SyntacticRuleChecker
Particle rules
Sequence rules
Linguistic rules
N-gram Checker
Bigram probs
Trigram probs
Smoothing
Semantic Checker (Optional)
ONNX model
Embedding lookup

SpellChecker Mixin Architecture

SpellChecker uses a mixin-based decomposition to organize detection and suggestion logic into focused modules while preserving a single public API surface:
SpellChecker (core/spellchecker.py)
PreNormalizationDetectorsMixin
11 pre-normalization detectors
Run before text normalization
PostNormalizationDetectorsMixin
38 ordered detectors (via detection_registry.py)
Particle confusion, medial confusion, compound typos, etc.
SentenceDetectorsMixin
10 sentence-level detectors
Register mixing, tense mismatch, structure issues
SuggestionPipelineMixin
24 suggestion/reranking methods
Unified ranking across sources
ErrorSuppressionMixin
21 suppression/dedup/merge methods
Prevents duplicate errors at same position

Detection Registry

The post-normalization detection pipeline is controlled by an ordered registry in core/detection_registry.py. Each entry maps to a _detect_* method inherited from detector mixins:
POST_NORM_DETECTOR_SEQUENCE = (
    # Stacking and structural errors (run first)
    "_detect_broken_stacking",              # Asat→virama in Pali words
    "_detect_missing_stacking",             # Missing Pali/Sanskrit virama stacking
    "_detect_missing_asat",                 # Missing asat on normalized text
    "_detect_missing_visarga_suffix",       # Missing visarga in clause-linker suffixes
    "_detect_missing_visarga_in_compound",  # Missing visarga inside compound words

    # Medial and particle confusion
    "_detect_medial_confusion",             # Medial ya-pin/ya-yit confusion
    "_detect_colloquial_contractions",      # Colloquial contraction detection
    "_detect_particle_confusion",           # Particle confusion (ကိ/ကု → ကို)
    "_detect_compound_confusion_typos",     # Compound confusion (ha-htoe + aspirated)
    "_detect_suffix_confusion_typos",       # Suffix confusion on invalid compounds

    # Token repair and frequency-based correction
    "_detect_invalid_token_with_strong_candidates",   # Invalid token repair via strong DB candidates
    "_detect_frequency_dominant_valid_variants",       # Valid-token variant correction via frequency + semantic
    "_detect_broken_compound_morpheme",     # Broken compound morpheme (ed-1 variant)
    "_detect_missegmented_confusable",      # Confusable errors hidden by segmentation

    # Particle and diacritic errors
    "_detect_ha_htoe_particle_typos",       # Ha-htoe particle confusion (မာ → မှာ)
    "_detect_aukmyit_confusion",            # Aukmyit confusion (ထည် → ထည့်)
    "_detect_extra_aukmyit_confusion",      # Extra aukmyit (ပြော့ → ပြော)
    "_detect_sequential_particle_confusion",# Sequential particle (တော် → တော့)
    "_detect_particle_misuse",              # Particle misuse via verb-frame (ကို → မှ/မှာ/တွင်)

    # Context-aware detectors
    "_detect_homophone_left_context",       # Homophone left-context (ဖက် → ဖတ်)
    "_detect_collocation_errors",           # Collocation error (wrong word partner)
    "_detect_semantic_agent_implausibility",# Non-human subject implausibility
    "_detect_merged_classifier_mismatch",   # Merged NUM+classifier mismatch

    # Sentence-level detectors
    "_detect_dangling_particles",           # Dangling sentence-end particles
    "_detect_sentence_structure_issues",    # Dangling word, missing conjunction
    "_detect_tense_mismatch",              # Temporal adverb vs particle mismatch
    "_detect_formal_yi_in_colloquial_context", # Verb+၏ in colloquial context
    "_detect_negation_sfp_mismatch",        # Negation pattern mismatch
    "_detect_merged_sfp_conjunction",       # Merged SFP + conjunction
    "_detect_missing_visarga",             # Missing visarga (း) via frequency ratio

    # Register and style
    "_detect_register_mixing",              # Formal/colloquial register mixing
    "_detect_informal_with_honorific",      # Informal particle + honorific
    "_detect_informal_h_after_completive",  # Terse ဟ after completive

    # Post-processing detectors
    "_detect_vowel_after_asat",             # Vowel after asat (ကျွန်ုတော် → ကျွန်တော်)
    "_detect_missing_diacritic_in_compound",# Missing anusvara/dot-below
    "_detect_unknown_compound_segments",    # Unknown freq=0 compound segments
    "_detect_broken_compound_space",        # Space inside compound word
    "_detect_punctuation_errors",           # Punctuation errors (lowest priority)
)
Ordering is intentional. For example, broken_stacking must run before colloquial_contractions to prevent stacking errors from being claimed as colloquial variants.

Data Layer Components

DictionaryProvider defines the abstract interface:
  • is_valid_syllable(syllable)bool
  • is_valid_word(word)bool
  • get_word_frequency(word)int
  • get_bigram_probability(prev, curr)float
DictionaryProvider (Abstract)
SQLiteProvider (disk-based, indexed, default)
MemoryProvider (RAM-based, fast, high mem)
JSONProvider (testing, simple)
CSVProvider (testing, simple)

Algorithm Components

  Algorithm Layer
  ===============

  SymSpell                        N-gram Model
  +--------------------------+    +--------------------------+
  | Input:  misspelled word  |    | Input:  word sequence    |
  | Output: suggestions      |    | Output: probability      |
  |                          |    |                          |
  | • Delete Dict            |    | • Bigram Probs           |
  |   (word -> deletes)      |    |   P(word2 | word1)       |
  | • Prefix Index           |    | • Trigram Probs          |
  |   (fast lookup)          |    |   P(word3 | word1,word2) |
  |                          |    |                          |
  | Complexity: O(1)         |    | Complexity: O(1)         |
  +--------------------------+    +--------------------------+

  Viterbi POS                     Edit Distance (Cython)
  +--------------------------+    +--------------------------+
  | Input:  word sequence    |    | • Levenshtein            |
  | Output: POS tags         |    | • Damerau-Levenshtein    |
  |                          |    | • Optimized C            |
  | • Transition Probs       |    +--------------------------+
  |   P(tag | prev_tag)      |
  | • Emission Probs         |    Semantic Model (ONNX)
  |   P(word | tag)          |    +--------------------------+
  |                          |    | • Word embeddings        |
  | Complexity: O(nT^2)      |    | • Cosine similarity      |
  +--------------------------+    | • Neural network         |
                                  +--------------------------+

Data Pipeline Components

  +------------------+     +---------------------+     +------------------+     +------------------+
  | CorpusIngester   | --> | CorpusSegmenter     | --> | FrequencyBuilder | --> | DatabasePackager |
  |                  |     | (Cython)            |     |                  |     |                  |
  | • Read files     |     | • Normalize         |     | • Count tokens   |     | • Create SQLite  |
  | • Parse formats  |     | • Segment           |     | • N-gram stats   |     | • Build indexes  |
  | • Validate       |     | • Parallel          |     | • Build tables   |     | • Optimize       |
  | • Stream         |     |                     |     |                  |     |                  |
  +------------------+     +---------------------+     +------------------+     +------------------+

Component Interactions

Check Operation Flow

See Data Flow for detailed check operation flow.

Suggestion Generation Flow

  +------------------+
  | Unknown word     |
  +--------+---------+
           |
           v
  +----------------------------------+
  | SymSpell                         |
  |                                  |
  | 1. Generate deletes from input   |
  |            |                     |
  |            v                     |
  | 2. Look up each delete in        |
  |    pre-computed dictionary       |
  |            |                     |
  |            v                     |
  | 3. Find candidate words within   |
  |    edit distance                 |
  |            |                     |
  |            v                     |
  | 4. Rank by (edit_distance,       |
  |    frequency)                    |
  +------------+---------------------+
               |
               v
  +----------------------------------+
  | Suggestions [word1, word2, ...]  |
  +----------------------------------+

Dependency Graph

SpellChecker
SyllableValidator → DictionaryProvider
WordValidator → DictionaryProvider
SymSpell → DictionaryProvider
ContextValidator → DictionaryProvider
SQLiteProvider (default)
MemoryProvider (alternative)

See Also