train-detector pipeline, and the library generates synthetic training errors automatically from its YAML rules.
How It Works
Token Classification
Unlike Semantic Checking (which masks each word and asks “what should go here?”), Error Detection classifies all tokens simultaneously:Training with Synthetic Errors
The model is trained on clean data corrupted with rule-based patterns:- Homophone swaps (
rules/homophones.yaml) - Medial confusion (ျ↔ြ, ွ↔ှ)
- Visually similar character swaps (
phonetic_data.py) - Character deletions and insertions
- Inverted typo patterns (
rules/typo_corrections.yaml)
Quick Start
Configuration
ErrorDetectorConfig
Confidence Threshold
Theconfidence_threshold controls sensitivity:
| Threshold | Behavior |
|---|---|
| 0.5 | Sensitive — catches more errors but more false positives |
| 0.7 | Balanced (default) |
| 0.9 | Conservative — fewer false positives but may miss errors |
Architecture
Validation Pipeline Position
Error Detection runs at priority 65, between N-gram (50) and Semantic (70):Design Decisions
- Binary labels only (CORRECT/ERROR) — simpler model, faster training. Error type classification is left to existing rule-based layers.
- Empty suggestions — the detector only flags errors, not corrections. Downstream strategies (SymSpell, SemanticChecker) provide corrections.
- Priority 65 — runs after all cheap rule-based strategies but before the expensive SemanticChecker. Skips positions already flagged.
- Everything optional — no model configured = library works exactly as before. No breaking changes.
Comparison with Semantic Checking
| Aspect | Error Detection | Semantic Checking |
|---|---|---|
| Strategy | ErrorDetectionStrategy (65) | SemanticValidationStrategy (70) |
| Model type | Token classification | Masked Language Model |
| Inference | Single forward pass | N forward passes (one per word) |
| Speed | ~10ms per sentence | ~200ms per sentence |
| Training | Fine-tune on synthetic errors | Train from scratch on corpus |
| Output | Error flags only | Error flags + suggestions |
| Use case | Real-time checking | Thorough analysis |
Direct API Access
See Also
- Training Guide - Training error detection models
- Semantic Checking - MLM-based semantic validation
- Validation Strategies - All validation strategies
- CLI Reference -
train-detectorcommand