Overview
The validation pipeline processes text through multiple strategies, each checking for different error types:| Strategy | Priority | Purpose | Error Type |
|---|---|---|---|
| ToneValidationStrategy | 10 | Tone mark disambiguation | tone_ambiguity |
| OrthographyValidationStrategy | 15 | Medial order and compatibility | medial_order_error |
| SyntacticValidationStrategy | 20 | Grammar rule checking | syntax_error |
| POSSequenceValidationStrategy | 30 | POS sequence validation | pos_sequence_error |
| QuestionStructureValidationStrategy | 40 | Question structure | question_structure |
| HomophoneValidationStrategy | 45 | Homophone detection | homophone_error |
| NgramContextValidationStrategy | 50 | N-gram probability | context_probability |
| ErrorDetectionStrategy | 65 | AI token classification (opt-in) | ai_detected |
| SemanticValidationStrategy | 70 | AI-powered semantic (opt-in) | semantic_error |
ValidationContext
All strategies receive a sharedValidationContext containing sentence-level information:
Context Attributes
| Attribute | Type | Description |
|---|---|---|
sentence | str | Full original sentence |
words | List[str] | Tokenized words |
word_positions | List[int] | Character position of each word |
is_name_mask | List[bool] | True if word is a proper name |
existing_errors | Set[int] | Word positions already flagged |
sentence_type | str | Sentence type for context |
pos_tags | List[str] | POS tags (if available) |
Strategy Implementations
ToneValidationStrategy (Priority: 10)
Handles tone mark disambiguation using context.- Missing tone marks (ငါ → ငါး in number context)
- Wrong tone marks based on context
- Ambiguous words resolved by surrounding words
OrthographyValidationStrategy (Priority: 15)
Validates medial consonant ordering and compatibility (UTN #11 rules) at the word level. Automatically included in the default pipeline. Detection:- Incorrect medial consonant order (e.g., ွ before ျ)
- Incompatible medial-consonant combinations
SyntacticValidationStrategy (Priority: 20)
Validates grammar rules and particle usage.- Particle errors (မှာ vs မှ)
- Medial confusion (ျ vs ြ)
- Missing particles
- Invalid word combinations
POSSequenceValidationStrategy (Priority: 30)
Validates POS tag sequences against expected patterns.- P-P: Consecutive particles → error (always flagged)
- N-N: Consecutive nouns without particle → warning (logged, not surfaced as error)
- V-V: Consecutive verbs → info (serial verb constructions are usually valid)
- Auxiliary verbs: နေ (progressive), ထား (resultative), လိုက် (action manner)
- Modal verbs: နိုင် (ability), ချင် (desire), ရ (permission)
- Directional verbs: သွား (away), လာ (toward)
QuestionStructureValidationStrategy (Priority: 40)
Validates question sentence structure.- Missing question particles (လား, သလဲ)
- Wrong question particle for context
- Question word agreement
NgramContextStrategy (Priority: 50)
Uses bigram/trigram probabilities to detect unlikely sequences.- Low probability word pairs
- Unusual word combinations
- Real-word errors (correct spelling, wrong context)
HomophoneStrategy (Priority: 45)
Detects homophone confusion based on context.- Homophone pairs (ကား/ကာ, သာ/သား)
- Context-based correct form selection
- Sound-alike word confusion
min_probability parameter prevents false positives from infrequent n-gram occurrences. When the current word has zero probability (unseen n-gram), a homophone is only suggested if its probability exceeds this threshold:
ErrorDetectionStrategy (Priority: 65) — Opt-in Required
AI-powered error detection using token classification. Unlike the MLM-based SemanticChecker (which masks each word and requires N forward passes), this strategy classifies all tokens in a single forward pass (~10ms), making it practical for real-time use. This strategy is not active by default. You must train a detector model first, then configureErrorDetectorConfig with the model path.
- Token-level error classification (CORRECT vs ERROR)
- Single forward pass for entire sentence
- Complements N-gram and semantic strategies
| Aspect | ErrorDetectionStrategy (65) | SemanticValidationStrategy (70) |
|---|---|---|
| Approach | Token classification | Masked Language Modeling |
| Speed | ~10ms (single pass) | ~50-150ms × N words |
| Output | Error flags only | Error flags + suggestions |
| Model | Fine-tuned XLM-RoBERTa | User-trained RoBERTa/BERT |
| Training | Requires train-detector | Requires train-model |
SemanticValidationStrategy (Priority: 70) — Opt-in Required
AI-powered validation using ONNX models. This strategy is not active by default. You must train a semantic model first, then configureSemanticConfig with the model path and set use_proactive_scanning=True.
- Semantic anomalies
- Word meaning in context
- Deep contextual errors
Creating Custom Strategies
Implement theValidationStrategy abstract base class:
Strategy Composition
In the default pipeline,SpellChecker coordinates validation directly through its validators:
- SyllableValidator — validates each syllable (layer 1)
- WordValidator — validates words via SymSpell (layer 2)
- ContextValidator — orchestrates validation strategies (layer 3)
ContextValidator receives a list of strategies built by SpellCheckerBuilder and executes them in priority order within each sentence.
Execution Order
- Strategies are sorted by priority (ascending)
- Each strategy receives the shared
ValidationContext - Strategies can check
existing_errorsto skip already-flagged words - Strategies add their flagged positions to
existing_errors - Errors from all strategies are collected and returned
Configuration
Enable/disable strategies via configuration:Error Types
Each strategy produces specific error types:| Error Type | Strategy | Description |
|---|---|---|
tone_ambiguity | Tone | Tone mark disambiguation |
syntax_error | Syntactic | Grammar rule violation |
pos_sequence_error | POS | Invalid POS sequence (P-P) |
question_structure | Question | Question structure issue |
homophone_error | Homophone | Sound-alike confusion |
context_probability | N-gram | Low probability sequence |
ai_detected | Error Detection | AI-flagged error (token classification) |
semantic_error | Semantic | AI-detected anomaly (opt-in) |
Best Practices
- Priority Selection: Choose priorities that make sense for your validation order
- Skip Flagged Words: Always check
existing_errorsto avoid duplicate errors - Skip Names: Respect the
is_name_maskto avoid flagging proper names - Confidence Scores: Use appropriate confidence levels for your error type
- Performance: Heavy validations (semantic) should run last
See Also
- Grammar Checkers - Rule-based grammar validation
- Context Checking - N-gram context validation
- POS Tagging - Part-of-speech tagging
- Error Detection - AI error detection feature
- Semantic Checking - MLM-based deep context analysis
- Training Custom Models - Train your own AI models