Skip to main content
Error detection fine-tunes an XLM-RoBERTa model to classify each token as CORRECT or ERROR in a single forward pass (~10ms), making it practical for real-time use. mySpellChecker does not ship with a pre-trained detector — you fine-tune your own on a clean Myanmar corpus using the train-detector pipeline, and the library generates synthetic training errors automatically from its YAML rules.

How It Works

Token Classification

Unlike Semantic Checking (which masks each word and asks “what should go here?”), Error Detection classifies all tokens simultaneously:
Input:  "ကျွန်တော် စာ ဖတ် တယ်"
Output: [CORRECT,  CORRECT, ERROR,  CORRECT]

                    flagged for review
The model processes the entire sentence in one forward pass, making it ~20x faster than MLM-based semantic checking.

Training with Synthetic Errors

The model is trained on clean data corrupted with rule-based patterns:
Clean:     "ကျွန်တော် စာ ဖတ် တယ်"
Corrupted: "ကျွန်တော် စာ ဗတ် တယ်"  (ဖ→ဗ similar char swap)
Labels:    [CORRECT,  CORRECT, ERROR,  CORRECT]
Corruption types draw from the library’s existing YAML rules:
  • Homophone swaps (rules/homophones.yaml)
  • Medial confusion (ျ↔ြ, ွ↔ှ)
  • Visually similar character swaps (phonetic_data.py)
  • Character deletions and insertions
  • Inverted typo patterns (rules/typo_corrections.yaml)

Quick Start

1

Train a Detector

# Prepare clean corpus (one sentence per line, UTF-8)
myspellchecker train-detector -i corpus.txt -o ./detector/
2

Use for Spell Checking

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import ErrorDetectorConfig

config = SpellCheckerConfig(
    error_detector=ErrorDetectorConfig(
        model_path="./detector/onnx/model.onnx",
        tokenizer_path="./detector/onnx",
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Configuration

ErrorDetectorConfig

from myspellchecker.core.config.algorithm_configs import ErrorDetectorConfig

config = ErrorDetectorConfig(
    # Model paths (provide paths OR instances, not both)
    model_path="/path/to/model.onnx",
    tokenizer_path="/path/to/tokenizer",

    # Inference settings
    num_threads=1,              # ONNX threads (CPU only)
    confidence_threshold=0.7,   # Minimum P(ERROR) to flag token
    use_pytorch=False,          # Force PyTorch backend

    # Feature toggle
    enabled=True,               # Enable when model is configured
)

Confidence Threshold

The confidence_threshold controls sensitivity:
ThresholdBehavior
0.5Sensitive — catches more errors but more false positives
0.7Balanced (default)
0.9Conservative — fewer false positives but may miss errors

Architecture

Validation Pipeline Position

Error Detection runs at priority 65, between N-gram (50) and Semantic (70):
Priority 10: Tone Validation
Priority 15: Orthography Validation
Priority 20: Syntactic Validation
Priority 30: POS Sequence Validation
Priority 40: Question Structure
Priority 45: Homophone Detection
Priority 50: N-gram Context
Priority 65: Error Detection (AI) ← NEW
Priority 70: Semantic Validation (AI)

Design Decisions

  1. Binary labels only (CORRECT/ERROR) — simpler model, faster training. Error type classification is left to existing rule-based layers.
  2. Empty suggestions — the detector only flags errors, not corrections. Downstream strategies (SymSpell, SemanticChecker) provide corrections.
  3. Priority 65 — runs after all cheap rule-based strategies but before the expensive SemanticChecker. Skips positions already flagged.
  4. Everything optional — no model configured = library works exactly as before. No breaking changes.

Comparison with Semantic Checking

AspectError DetectionSemantic Checking
StrategyErrorDetectionStrategy (65)SemanticValidationStrategy (70)
Model typeToken classificationMasked Language Model
InferenceSingle forward passN forward passes (one per word)
Speed~10ms per sentence~200ms per sentence
TrainingFine-tune on synthetic errorsTrain from scratch on corpus
OutputError flags onlyError flags + suggestions
Use caseReal-time checkingThorough analysis
Use both together: ErrorDetector quickly flags suspicious tokens, then SemanticChecker provides corrections for those positions.

Direct API Access

from myspellchecker.algorithms.error_detector import ErrorDetector

detector = ErrorDetector(
    model_path="./detector/onnx/model.onnx",
    tokenizer_path="./detector/onnx",
    confidence_threshold=0.7,
)

# Detect errors in text
errors = detector.detect_errors(
    text="ကျွန်တော် စာ ဖတ် တယ်",
    words=["ကျွန်တော်", "စာ", "ဖတ်", "တယ်"],
)

for char_pos, word, confidence in errors:
    print(f"Error at position {char_pos}: '{word}' (confidence: {confidence:.2f})")

See Also