Component Diagram - mySpellChecker

The diagrams below show how validation, data, utility, and algorithm layers interact within the system.

High-Level Architecture

Validation Layer Components

SyllableValidator

SyllableRuleValidator

Structure rules

Medial order

Vowel compatibility

Dictionary Lookup

Syllable exists?

Frequency lookup

↓

WordValidator

Dictionary Lookup

Word exists?

Get frequency

Get POS

SymSpell Algorithm

Generate deletes

Find suggestions

Rank by distance

↓

ContextValidator (Strategy-based)

SyntacticValidationStrategy (Layer 2.5)

POS Tagger

Viterbi HMM

Transformer

Rule-based

SyntacticRuleChecker

Particle rules

Sequence rules

Linguistic rules

N-gram Checker

Bigram probs

Trigram probs

Smoothing

Semantic Checker (Optional)

ONNX model

Embedding lookup

SpellChecker Mixin Architecture

SpellChecker uses a mixin-based decomposition to organize detection and suggestion logic into focused modules while preserving a single public API surface:

SpellChecker (core/spellchecker.py)

PreNormalizationDetectorsMixin

11 pre-normalization detectors

Run before text normalization

PostNormalizationDetectorsMixin

38 ordered detectors (via detection_registry.py)

Particle confusion, medial confusion, compound typos, etc.

SentenceDetectorsMixin

10 sentence-level detectors

SuggestionPipelineMixin

24 suggestion/reranking methods

Unified ranking across sources

ErrorSuppressionMixin

21 suppression/dedup/merge methods

Prevents duplicate errors at same position

Detection Registry

The post-normalization detection pipeline is controlled by an ordered registry in core/detection_registry.py. Each entry maps to a _detect_* method inherited from detector mixins:

POST_NORM_DETECTOR_SEQUENCE = (
    # Stacking and structural errors (run first)
    "_detect_broken_stacking",              # Asat→virama in Pali words
    "_detect_missing_stacking",             # Missing Pali/Sanskrit virama stacking
    "_detect_missing_asat",                 # Missing asat on normalized text
    "_detect_missing_visarga_suffix",       # Missing visarga in clause-linker suffixes
    "_detect_missing_visarga_in_compound",  # Missing visarga inside compound words

    # Medial and particle confusion
    "_detect_medial_confusion",             # Medial ya-pin/ya-yit confusion
    "_detect_colloquial_contractions",      # Colloquial contraction detection
    "_detect_particle_confusion",           # Particle confusion (ကိ/ကု → ကို)
    "_detect_compound_confusion_typos",     # Compound confusion (ha-htoe + aspirated)
    "_detect_suffix_confusion_typos",       # Suffix confusion on invalid compounds

    # Token repair and frequency-based correction
    "_detect_invalid_token_with_strong_candidates",   # Invalid token repair via strong DB candidates
    "_detect_frequency_dominant_valid_variants",       # Valid-token variant correction via frequency + semantic
    "_detect_broken_compound_morpheme",     # Broken compound morpheme (ed-1 variant)
    "_detect_missegmented_confusable",      # Confusable errors hidden by segmentation

    # Particle and diacritic errors
    "_detect_ha_htoe_particle_typos",       # Ha-htoe particle confusion (မာ → မှာ)
    "_detect_aukmyit_confusion",            # Aukmyit confusion (ထည် → ထည့်)
    "_detect_extra_aukmyit_confusion",      # Extra aukmyit (ပြော့ → ပြော)
    "_detect_sequential_particle_confusion",# Sequential particle (တော် → တော့)
    "_detect_particle_misuse",              # Particle misuse via verb-frame (ကို → မှ/မှာ/တွင်)

    # Context-aware detectors
    "_detect_homophone_left_context",       # Homophone left-context (ဖက် → ဖတ်)
    "_detect_collocation_errors",           # Collocation error (wrong word partner)
    "_detect_semantic_agent_implausibility",# Non-human subject implausibility
    "_detect_merged_classifier_mismatch",   # Merged NUM+classifier mismatch

    # Sentence-level detectors
    "_detect_dangling_particles",           # Dangling sentence-end particles
    "_detect_sentence_structure_issues",    # Dangling word, missing conjunction
    "_detect_tense_mismatch",              # Temporal adverb vs particle mismatch
    "_detect_formal_yi_in_colloquial_context", # Verb+၏ in colloquial context
    "_detect_negation_sfp_mismatch",        # Negation pattern mismatch
    "_detect_merged_sfp_conjunction",       # Merged SFP + conjunction
    "_detect_missing_visarga",             # Missing visarga (း) via frequency ratio

    # Register and style
    "_detect_register_mixing",              # Formal/colloquial register mixing
    "_detect_informal_with_honorific",      # Informal particle + honorific
    "_detect_informal_h_after_completive",  # Terse ဟ after completive

    # Post-processing detectors
    "_detect_vowel_after_asat",             # Vowel after asat (ကျွန်ုတော် → ကျွန်တော်)
    "_detect_missing_diacritic_in_compound",# Missing anusvara/dot-below
    "_detect_unknown_compound_segments",    # Unknown freq=0 compound segments
    "_detect_broken_compound_space",        # Space inside compound word
    "_detect_punctuation_errors",           # Punctuation errors (lowest priority)
)

Ordering is intentional. For example, broken_stacking must run before colloquial_contractions to prevent stacking errors from being claimed as colloquial variants.

Data Layer Components

DictionaryProvider defines the abstract interface:

is_valid_syllable(syllable) → bool
is_valid_word(word) → bool
get_word_frequency(word) → int
get_bigram_probability(prev, curr) → float

DictionaryProvider (Abstract)

SQLiteProvider (disk-based, indexed, default)

MemoryProvider (RAM-based, fast, high mem)

JSONProvider (testing, simple)

CSVProvider (testing, simple)

Algorithm Components

  Algorithm Layer
  ===============

  SymSpell                        N-gram Model
  +--------------------------+    +--------------------------+
  | Input:  misspelled word  |    | Input:  word sequence    |
  | Output: suggestions      |    | Output: probability      |
  |                          |    |                          |
  | • Delete Dict            |    | • Bigram Probs           |
  |   (word -> deletes)      |    |   P(word2 | word1)       |
  | • Prefix Index           |    | • Trigram Probs          |
  |   (fast lookup)          |    |   P(word3 | word1,word2) |
  |                          |    |                          |
  | Complexity: O(1)         |    | Complexity: O(1)         |
  +--------------------------+    +--------------------------+

  Viterbi POS                     Edit Distance (Cython)
  +--------------------------+    +--------------------------+
  | Input:  word sequence    |    | • Levenshtein            |
  | Output: POS tags         |    | • Damerau-Levenshtein    |
  |                          |    | • Optimized C            |
  | • Transition Probs       |    +--------------------------+
  |   P(tag | prev_tag)      |
  | • Emission Probs         |    Semantic Model (ONNX)
  |   P(word | tag)          |    +--------------------------+
  |                          |    | • Word embeddings        |
  | Complexity: O(nT^2)      |    | • Cosine similarity      |
  +--------------------------+    | • Neural network         |
                                  +--------------------------+

Data Pipeline Components

  +------------------+     +---------------------+     +------------------+     +------------------+
  | CorpusIngester   | --> | CorpusSegmenter     | --> | FrequencyBuilder | --> | DatabasePackager |
  |                  |     | (Cython)            |     |                  |     |                  |
  | • Read files     |     | • Normalize         |     | • Count tokens   |     | • Create SQLite  |
  | • Parse formats  |     | • Segment           |     | • N-gram stats   |     | • Build indexes  |
  | • Validate       |     | • Parallel          |     | • Build tables   |     | • Optimize       |
  | • Stream         |     |                     |     |                  |     |                  |
  +------------------+     +---------------------+     +------------------+     +------------------+

Component Interactions

Check Operation Flow

See Data Flow for detailed check operation flow.

Suggestion Generation Flow

  +------------------+
  | Unknown word     |
  +--------+---------+
           |
           v
  +----------------------------------+
  | SymSpell                         |
  |                                  |
  | 1. Generate deletes from input   |
  |            |                     |
  |            v                     |
  | 2. Look up each delete in        |
  |    pre-computed dictionary       |
  |            |                     |
  |            v                     |
  | 3. Find candidate words within   |
  |    edit distance                 |
  |            |                     |
  |            v                     |
  | 4. Rank by (edit_distance,       |
  |    frequency)                    |
  +------------+---------------------+
               |
               v
  +----------------------------------+
  | Suggestions [word1, word2, ...]  |
  +----------------------------------+

Dependency Graph

SpellChecker

SyllableValidator → DictionaryProvider

WordValidator → DictionaryProvider

SymSpell → DictionaryProvider

ContextValidator → DictionaryProvider

SQLiteProvider (default)

MemoryProvider (alternative)

​High-Level Architecture

​Validation Layer Components

​SpellChecker Mixin Architecture

​Detection Registry

​Data Layer Components

​Algorithm Components

​Data Pipeline Components

​Component Interactions

​Check Operation Flow

​Suggestion Generation Flow

​Dependency Graph

​See Also