Pipeline Overview
Pre-Processing
Text Normalization
Before validation, text is normalized:- Zero-width removal: Remove invisible characters
- Unicode normalization: NFC form for consistent comparison
- Zawgyi handling: Detect and optionally convert legacy encoding
- Whitespace normalization: Consistent spacing
Syllable Segmentation
Text is broken into syllables using Myanmar orthographic rules:- Consonant boundaries: Detect syllable starts
- Combining character rules: Group marks correctly
- Stacking rules: Handle complex consonant clusters
Layer 1: Syllable Validation
Purpose
Validate that each syllable is orthographically correct and exists in the dictionary.Implementation
Rule-Based Validation
Syllable structure rules fromSyllableRuleValidator:
Dictionary Lookup
Error Coverage
- Typos: ~90% caught at this layer
- Invalid characters: 100% caught
- Structural errors: 100% caught
Post-Normalization Detectors
Purpose
Between syllable validation and word validation, 38 ordered detectors run unconditionally on the normalized text. These catch character-level and particle-level errors that require normalized input but don’t depend on word segmentation.Implementation
The detectors are defined incore/detection_registry.py as an ordered sequence (POST_NORM_DETECTOR_SEQUENCE). Each entry maps to a _detect_* method inherited from detector mixins (PostNormalizationDetectorsMixin, SentenceDetectorsMixin, CollocationDetectionMixin, etc.):
Detector Categories
The 38 detectors are grouped by category:| Category | Detectors | Examples |
|---|---|---|
| Stacking and structural | 5 | Broken stacking, missing asat, missing visarga |
| Medial and particle confusion | 5 | Medial ya-pin/ya-yit, particle confusion, compound confusion |
| Token repair and frequency | 4 | Invalid token repair, frequency-dominant variants, broken compound morpheme |
| Particle and diacritic | 5 | Ha-htoe particle typos, aukmyit confusion, particle misuse |
| Context-aware | 4 | Homophone left-context, collocation errors, semantic agent implausibility |
| Sentence-level | 7 | Dangling particles, tense mismatch, negation mismatch, missing visarga |
| Register and style | 3 | Register mixing, informal with honorific |
| Post-processing | 5 | Vowel after asat, missing diacritic, unknown compound segments, broken compound space, punctuation errors |
Ordering is intentional. For example,
_detect_broken_stacking must run before _detect_colloquial_contractions to prevent stacking errors from being claimed as colloquial variants. See the Component Diagram for the full detector registry.Layer 2: Word Validation
Purpose
Verify that valid syllables form valid words, and provide suggestions for unknown words.Word Assembly
Valid syllables are assembled into words using a longest-match algorithm within the word validation layer. There is no separateWordAssembler class —
word assembly logic is integrated into the segmentation and validation pipeline.
Word assembly uses a longest-match algorithm (implemented within the validator/segmenter,
not as a standalone class):
Validation Steps
For unknown words, Layer 2 performs multiple checks before generating an error:Error Coverage
- Unknown words: 100% detected
- Compound errors: ~80% with suggestions
- Near-misses: ~95% with correct suggestion
- Productive compounds: Accepted without error (N+N, V+V, etc.)
- Productive reduplications: Accepted without error (AA, AABB, ABAB)
Layer 2.5: Grammar Checking
Purpose
Validate syntactic correctness using POS tags and grammar rules.Implementation
Grammar checking is implemented through theSyntacticRuleChecker engine and
validation strategies within ContextValidator. There is no separate GrammarChecker
validator class - instead, grammar rules are applied as part of the context validation
pipeline.
Grammar Rules
Grammar rules are defined in YAML files (src/myspellchecker/rules/) and include:
- Subject particle must follow noun
- Object particle must follow noun/pronoun
- Sentence should end with final particle
- Question should have question marker
- Aspect markers must follow verbs
Specialized Checkers
The grammar system includes specialized checkers insrc/myspellchecker/grammar/checkers/:
- AspectChecker: Validates aspect marker usage
- ClassifierChecker: Validates classifier usage
- CompoundChecker: Validates compound words
- MergedWordChecker: Detects incorrectly merged particle+verb sequences
- NegationChecker: Validates negation patterns
- RegisterChecker: Validates formal/informal register consistency
Error Coverage
- Particle errors: ~90% detected
- Verb agreement: ~85% detected
- Structure errors: ~80% detected
Layer 3: Context Validation
Purpose
Detect real-word errors where words are spelled correctly but used incorrectly.N-gram Analysis
Semantic Verification
For ambiguous cases, semantic checking provides deeper analysis:Error Coverage
- Real-word errors: ~85% detected
- Context misuse: ~80% detected
- Homograph disambiguation: ~90% with semantic
Pipeline Configuration
Validation Levels
Validation level is specified per-check, not in configuration:Layer Enable/Disable
Performance Characteristics
| Layer | Speed | Coverage |
|---|---|---|
| Syllable | Fast | ~90% of errors |
| Word | Moderate | 5-8% additional |
| Grammar | Moderate | Grammar-only |
| Context | Moderate | 5-10% additional |
| Semantic | Slow | Verification |
For measured end-to-end performance (F1 96.2% without semantic, 98.3% with semantic v2.3), see the benchmarks page.
Error Aggregation
All layer errors are combined into the finalResponse object. There is no
separate ResultAssembler class — error aggregation is handled directly by
SpellChecker when assembling results from each validation layer.
The aggregation process:
- Collect errors from each layer (syllable, word, grammar, context)
- Run suggestion reconstruction (
_reconstruct_compound_suggestions, etc.) - Deduplicate errors at the same position via
_dedup_errors_by_position - Deduplicate overlapping error spans via
_dedup_errors_by_span - Apply suppression filters (low-value errors, NER entities)
- Return a
Responsecontaining the filtered error list
Next Steps
- Layer 1 Details - Syllable validation
- Layer 2 Details - SymSpell algorithm
- Layer 3 Details - N-gram context
- Performance Tuning - Optimization