Confusable Detection - mySpellChecker

Myanmar text frequently contains “real-word errors” where one valid word is confused with another visually or phonetically similar word. Unlike misspellings, both the error and the correction exist in the dictionary, making detection much harder. mySpellChecker uses three complementary strategies to catch these errors at different cost/accuracy trade-offs.

Overview

Confusable detection operates through three validation strategies that run in priority order:

Strategy	Priority	Method	Speed	Requires
StatisticalConfusable	24	Bigram ratio comparison	~0.3ms	Dictionary DB
ConfusableCompoundClassifier	47	MLP binary classifier (ONNX)	~1ms	ONNX model
ConfusableSemantic	48	MLM logit comparison	~15ms	Semantic model

Each strategy targets different confusable types and operates independently, so you can enable any combination based on your accuracy/speed requirements.

Confusable Sources

Confusable pairs come from two sources:

Database (confusable_pairs table): ~21K pairs mined from corpus data during the enrichment pipeline. Covers aspiration swaps, medial confusion, nasal endings, and tone mark variants.
YAML (rules/confusable_pairs.yaml): Curated fallback pairs that corpus mining cannot discover.

# Confusable pairs are loaded automatically when using SQLiteProvider
from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider(database_path="path/to/dictionary.db")
# provider.get_confusable_pairs("ကြောင်း") → {"ကျောင်း", ...}

Statistical Confusable Strategy

The fastest confusable detector. Compares bidirectional bigram probabilities to determine if a confusable variant fits the context better than the current word.

How It Works

For each word, look up known confusable variants from the database
Compare bidirectional bigram ratios:
- Left context: P(variant | previous_word) vs P(word | previous_word)
- Right context: P(next_word | variant) vs P(next_word | word)
If the combined ratio exceeds the threshold, flag as confusable error

Configuration

from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig

# Statistical confusable is enabled by default when use_context_checker=True
config = SpellCheckerConfig(
    use_context_checker=True
)

The strategy runs at priority 24 (within the structural phase), so it executes even when the fast-path is enabled. This ensures confusable errors on “structurally clean” text are not skipped.

Parameters

Parameter	Default	Description
`threshold`	5.0	Minimum bidirectional bigram ratio to trigger detection
`confidence`	0.85	Confidence score assigned to detected errors

Confusable Compound Classifier

An MLP binary classifier (ONNX) that detects confusable pairs and broken compounds using 22 extracted features including frequency, N-gram, PMI, POS tags, and morphological patterns.

Features Used

The classifier extracts features such as:

Word and variant frequencies (log-scaled)
Bigram probabilities in both directions
PMI (Pointwise Mutual Information) with neighbors
POS tag compatibility
Morphological pattern indicators (title suffixes, compound markers)

Configuration

# Requires an ONNX classifier model
# The model path is configured via the builder or config
from myspellchecker.core import SpellCheckerBuilder

checker = (
    SpellCheckerBuilder()
    .with_confusable_classifier("path/to/classifier.onnx")
    .build()
)

Confusable Semantic Strategy (MLM)

The most accurate but slowest confusable detector. Uses a masked language model to compare the contextual fit of the current word against its confusable variants.

How It Works

For each valid word, generate confusable variants (character substitutions, medial swaps)
Filter variants to valid dictionary words
Use MLM predict_mask() to get logits for both the current word and the best variant
If the logit difference exceeds the threshold, flag as confusable error

Example

Input:  ကျွန်တော် ကြောင်းကို သွားတယ်။
                  ^^^^^^^^
- "ကြောင်း" (cat) is valid — passes all rule-based checks
- MLM predicts "ကျောင်း" (school) with much higher logit in "went to [X]" context
- logit_diff exceeds threshold → flagged as confusable_error
- Suggestion: ကျောင်း (school)

Configuration

from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

# Requires semantic model to be loaded
config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="path/to/semantic-model",
        enabled=True,
    )
)

Parameters

Parameter	Default	Description
`logit_threshold`	2.0	Minimum logit difference to trigger detection
`confidence`	0.85	Confidence score for confusable errors

Guards and Filters

The strategy applies several guards to reduce false positives:

Exempt pairs: Known pairs that should not trigger (e.g., particles with overlapping usage)
Variant blocklist: Specific variants excluded from detection
Medial-only pairs: Pairs differing only in medial consonants use adjusted thresholds
Tone-only pairs: Pairs differing only in tone marks use higher thresholds
DB suppression: Pairs explicitly suppressed in the database

Integration with SpellChecker

All confusable strategies are automatically configured when using the SpellCheckerBuilder:

from myspellchecker.core import SpellCheckerBuilder, ConfigPresets

checker = (
    SpellCheckerBuilder()
    .with_config(ConfigPresets.ACCURATE)
    .build()
)

result = checker.check("ကျွန်တော် ကြောင်းကို သွားတယ်။")

for error in result.errors:
    if error.error_type == "confusable_error":
        print(f"{error.text} → {error.suggestions[0]}")

Performance

Strategy	Latency	Memory	Accuracy
Statistical	~0.3ms/word	Minimal	Good for high-frequency pairs
MLP Classifier	~1ms/word	~5MB model	Good for compound confusables
MLM Semantic	~15ms/word	~71MB model	Best for context-dependent pairs

​Overview

​Confusable Sources

​Statistical Confusable Strategy

​How It Works

​Configuration

​Parameters

​Confusable Compound Classifier

​Features Used

​Configuration

​Confusable Semantic Strategy (MLM)

​How It Works

​Example

​Configuration

​Parameters

​Guards and Filters

​Integration with SpellChecker

​Performance

​See Also

Overview

Confusable Sources

Statistical Confusable Strategy

How It Works

Configuration

Parameters

Confusable Compound Classifier

Features Used

Configuration

Confusable Semantic Strategy (MLM)

How It Works

Example

Configuration

Parameters

Guards and Filters

Integration with SpellChecker

Performance

See Also