Error Detection

Error detection fine-tunes an XLM-RoBERTa model to classify each token as CORRECT or ERROR in a single forward pass (~10ms), making it practical for real-time use. mySpellChecker does not ship with a pre-trained detector — you fine-tune your own on a clean Myanmar corpus using the train-detector pipeline, and the library generates synthetic training errors automatically from its YAML rules.

How It Works

Token Classification

Unlike Semantic Checking (which masks each word and asks “what should go here?”), Error Detection classifies all tokens simultaneously:

Input:  "ကျွန်တော် စာ ဖတ် တယ်"
Output: [CORRECT,  CORRECT, ERROR,  CORRECT]
                            ↑
                    flagged for review

The model processes the entire sentence in one forward pass, making it ~20x faster than MLM-based semantic checking.

Training with Synthetic Errors

The model is trained on clean data corrupted with rule-based patterns:

Clean:     "ကျွန်တော် စာ ဖတ် တယ်"
Corrupted: "ကျွန်တော် စာ ဗတ် တယ်"  (ဖ→ဗ similar char swap)
Labels:    [CORRECT,  CORRECT, ERROR,  CORRECT]

Corruption types draw from the library’s existing YAML rules:

Homophone swaps (rules/homophones.yaml)
Medial confusion (ျ↔ြ, ွ↔ှ)
Visually similar character swaps (phonetic_data.py)
Character deletions and insertions
Inverted typo patterns (rules/typo_corrections.yaml)

Quick Start

Train a Detector

# Prepare clean corpus (one sentence per line, UTF-8)
myspellchecker train-detector -i corpus.txt -o ./detector/

Use for Spell Checking

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import ErrorDetectorConfig

config = SpellCheckerConfig(
    error_detector=ErrorDetectorConfig(
        model_path="./detector/onnx/model.onnx",
        tokenizer_path="./detector/onnx",
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Configuration

ErrorDetectorConfig

from myspellchecker.core.config.algorithm_configs import ErrorDetectorConfig

config = ErrorDetectorConfig(
    # Model paths (provide paths OR instances, not both)
    model_path="/path/to/model.onnx",
    tokenizer_path="/path/to/tokenizer",

    # Inference settings
    num_threads=1,              # ONNX threads (CPU only)
    confidence_threshold=0.7,   # Minimum P(ERROR) to flag token
    use_pytorch=False,          # Force PyTorch backend

    # Feature toggle
    enabled=True,               # Enable when model is configured
)

Confidence Threshold

The confidence_threshold controls sensitivity:

Threshold	Behavior
0.5	Sensitive — catches more errors but more false positives
0.7	Balanced (default)
0.9	Conservative — fewer false positives but may miss errors

Architecture

Validation Pipeline Position

Error Detection runs at priority 65, between N-gram (50) and Semantic (70):

Priority 10: Tone Validation
Priority 15: Orthography Validation
Priority 20: Syntactic Validation
Priority 30: POS Sequence Validation
Priority 40: Question Structure
Priority 45: Homophone Detection
Priority 50: N-gram Context
Priority 65: Error Detection (AI) ← NEW
Priority 70: Semantic Validation (AI)

Design Decisions

Binary labels only (CORRECT/ERROR) — simpler model, faster training. Error type classification is left to existing rule-based layers.
Empty suggestions — the detector only flags errors, not corrections. Downstream strategies (SymSpell, SemanticChecker) provide corrections.
Priority 65 — runs after all cheap rule-based strategies but before the expensive SemanticChecker. Skips positions already flagged.
Everything optional — no model configured = library works exactly as before. No breaking changes.

Comparison with Semantic Checking

Aspect	Error Detection	Semantic Checking
Strategy	ErrorDetectionStrategy (65)	SemanticValidationStrategy (70)
Model type	Token classification	Masked Language Model
Inference	Single forward pass	N forward passes (one per word)
Speed	~10ms per sentence	~200ms per sentence
Training	Fine-tune on synthetic errors	Train from scratch on corpus
Output	Error flags only	Error flags + suggestions
Use case	Real-time checking	Thorough analysis

Use both together: ErrorDetector quickly flags suspicious tokens, then SemanticChecker provides corrections for those positions.

Direct API Access

from myspellchecker.algorithms.error_detector import ErrorDetector

detector = ErrorDetector(
    model_path="./detector/onnx/model.onnx",
    tokenizer_path="./detector/onnx",
    confidence_threshold=0.7,
)

# Detect errors in text
errors = detector.detect_errors(
    text="ကျွန်တော် စာ ဖတ် တယ်",
    words=["ကျွန်တော်", "စာ", "ဖတ်", "တယ်"],
)

for char_pos, word, confidence in errors:
    print(f"Error at position {char_pos}: '{word}' (confidence: {confidence:.2f})")

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

How It Works

Token Classification

Training with Synthetic Errors

Quick Start

Configuration

ErrorDetectorConfig

Confidence Threshold

Architecture

Validation Pipeline Position

Design Decisions

Comparison with Semantic Checking

Direct API Access

See Also

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​How It Works

​Token Classification

​Training with Synthetic Errors

​Quick Start

​Configuration

​ErrorDetectorConfig

​Confidence Threshold

​Architecture

​Validation Pipeline Position

​Design Decisions

​Comparison with Semantic Checking

​Direct API Access

​See Also

How It Works

Token Classification

Training with Synthetic Errors

Quick Start

Configuration

ErrorDetectorConfig

Confidence Threshold

Architecture

Validation Pipeline Position

Design Decisions

Comparison with Semantic Checking

Direct API Access

See Also