Overview
Confusable detection operates through three validation strategies that run in priority order:| Strategy | Priority | Method | Speed | Requires |
|---|---|---|---|---|
| StatisticalConfusable | 24 | Bigram ratio comparison | ~0.3ms | Dictionary DB |
| ConfusableCompoundClassifier | 47 | MLP binary classifier (ONNX) | ~1ms | ONNX model |
| ConfusableSemantic | 48 | MLM logit comparison | ~15ms | Semantic model |
Confusable Sources
Confusable pairs come from two sources:- Database (
confusable_pairstable): ~21K pairs mined from corpus data during the enrichment pipeline. Covers aspiration swaps, medial confusion, nasal endings, and tone mark variants. - YAML (
rules/confusable_pairs.yaml): Curated fallback pairs that corpus mining cannot discover.
Statistical Confusable Strategy
The fastest confusable detector. Compares bidirectional bigram probabilities to determine if a confusable variant fits the context better than the current word.How It Works
- For each word, look up known confusable variants from the database
- Compare bidirectional bigram ratios:
- Left context: P(variant | previous_word) vs P(word | previous_word)
- Right context: P(next_word | variant) vs P(next_word | word)
- If the combined ratio exceeds the threshold, flag as confusable error
Configuration
Parameters
| Parameter | Default | Description |
|---|---|---|
threshold | 5.0 | Minimum bidirectional bigram ratio to trigger detection |
confidence | 0.85 | Confidence score assigned to detected errors |
Confusable Compound Classifier
An MLP binary classifier (ONNX) that detects confusable pairs and broken compounds using 22 extracted features including frequency, N-gram, PMI, POS tags, and morphological patterns.Features Used
The classifier extracts features such as:- Word and variant frequencies (log-scaled)
- Bigram probabilities in both directions
- PMI (Pointwise Mutual Information) with neighbors
- POS tag compatibility
- Morphological pattern indicators (title suffixes, compound markers)
Configuration
Confusable Semantic Strategy (MLM)
The most accurate but slowest confusable detector. Uses a masked language model to compare the contextual fit of the current word against its confusable variants.How It Works
- For each valid word, generate confusable variants (character substitutions, medial swaps)
- Filter variants to valid dictionary words
- Use MLM
predict_mask()to get logits for both the current word and the best variant - If the logit difference exceeds the threshold, flag as confusable error
Example
Configuration
Parameters
| Parameter | Default | Description |
|---|---|---|
logit_threshold | 2.0 | Minimum logit difference to trigger detection |
confidence | 0.85 | Confidence score for confusable errors |
Guards and Filters
The strategy applies several guards to reduce false positives:- Exempt pairs: Known pairs that should not trigger (e.g., particles with overlapping usage)
- Variant blocklist: Specific variants excluded from detection
- Medial-only pairs: Pairs differing only in medial consonants use adjusted thresholds
- Tone-only pairs: Pairs differing only in tone marks use higher thresholds
- DB suppression: Pairs explicitly suppressed in the database
Integration with SpellChecker
All confusable strategies are automatically configured when using the SpellCheckerBuilder:Performance
| Strategy | Latency | Memory | Accuracy |
|---|---|---|---|
| Statistical | ~0.3ms/word | Minimal | Good for high-frequency pairs |
| MLP Classifier | ~1ms/word | ~5MB model | Good for compound confusables |
| MLM Semantic | ~15ms/word | ~71MB model | Best for context-dependent pairs |
See Also
- Validation Strategies — Full strategy pipeline
- Homophones Detection — Sound-alike word detection
- Context Checking — N-gram context validation
- Semantic Checking — AI-powered MLM validation