Pipeline Overview
Pre-Processing
Text Normalization
Before validation, text is normalized:- Zero-width removal: Remove invisible characters
- Unicode normalization: NFC form for consistent comparison
- Zawgyi handling: Detect and optionally convert legacy encoding
- Whitespace normalization: Consistent spacing
Syllable Segmentation
Text is broken into syllables using Myanmar orthographic rules:- Consonant boundaries: Detect syllable starts
- Combining character rules: Group marks correctly
- Stacking rules: Handle complex consonant clusters
Layer 1: Syllable Validation
Purpose
Validate that each syllable is orthographically correct and exists in the dictionary.Implementation
Rule-Based Validation
Syllable structure rules fromSyllableRuleValidator:
Dictionary Lookup
Error Coverage
- Typos: ~90% caught at this layer
- Invalid characters: 100% caught
- Structural errors: 100% caught
Layer 2: Word Validation
Purpose
Verify that valid syllables form valid words, and provide suggestions for unknown words.Word Assembly
Valid syllables are assembled into words using a longest-match algorithm within the word validation layer. There is no separateWordAssembler class —
word assembly logic is integrated into the segmentation and validation pipeline.
Word assembly uses a longest-match algorithm (implemented within the validator/segmenter,
not as a standalone class):
Validation Steps
For unknown words, Layer 2 performs multiple checks before generating an error:Error Coverage
- Unknown words: 100% detected
- Compound errors: ~80% with suggestions
- Near-misses: ~95% with correct suggestion
- Productive compounds: Accepted without error (N+N, V+V, etc.)
- Productive reduplications: Accepted without error (AA, AABB, ABAB)
Layer 2.5: Grammar Checking
Purpose
Validate syntactic correctness using POS tags and grammar rules.Implementation
Grammar checking is implemented through theSyntacticRuleChecker engine and
validation strategies within ContextValidator. There is no separate GrammarChecker
validator class - instead, grammar rules are applied as part of the context validation
pipeline.
Grammar Rules
Grammar rules are defined in YAML files (src/myspellchecker/rules/) and include:
- Subject particle must follow noun
- Object particle must follow noun/pronoun
- Sentence should end with final particle
- Question should have question marker
- Aspect markers must follow verbs
Specialized Checkers
The grammar system includes specialized checkers insrc/myspellchecker/grammar/checkers/:
- AspectChecker: Validates aspect marker usage
- ClassifierChecker: Validates classifier usage
- CompoundChecker: Validates compound words
- NegationChecker: Validates negation patterns
- RegisterChecker: Validates formal/informal register consistency
Error Coverage
- Particle errors: ~90% detected
- Verb agreement: ~85% detected
- Structure errors: ~80% detected
Layer 3: Context Validation
Purpose
Detect real-word errors where words are spelled correctly but used incorrectly.N-gram Analysis
Semantic Verification
For ambiguous cases, semantic checking provides deeper analysis:Error Coverage
- Real-word errors: ~85% detected
- Context misuse: ~80% detected
- Homograph disambiguation: ~90% with semantic
Pipeline Configuration
Validation Levels
Validation level is specified per-check, not in configuration:Layer Enable/Disable
Performance Characteristics
| Layer | Time | Accuracy | Coverage |
|---|---|---|---|
| Syllable | <10ms | 95% | 90% |
| Word | ~50ms | 98% | 5-8% |
| Grammar | ~50ms | 90% | Grammar-only |
| Context | ~100ms | 85% | 5-10% |
| Semantic | ~200ms | 95% | Verification |
Error Aggregation
All layer errors are combined into the finalResponse object. There is no
separate ResultAssembler class — error aggregation is handled directly by
SpellChecker when assembling results from each validation layer.
The aggregation process:
- Collect errors from each layer (syllable, word, grammar, context)
- Sort by position in the original text
- Deduplicate errors at the same position
- Return a
Responsecontaining the merged error list
Next Steps
- Layer 1 Details - Syllable validation
- Layer 2 Details - SymSpell algorithm
- Layer 3 Details - N-gram context
- Performance Tuning - Optimization