Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Myanmar text frequently contains “real-word errors” where one valid word is confused with another visually or phonetically similar word. Unlike misspellings, both the error and the correction exist in the dictionary, making detection much harder. mySpellChecker uses three complementary strategies to catch these errors at different cost/accuracy trade-offs.
Overview
Confusable detection operates through three validation strategies that run in priority order:
| Strategy | Priority | Method | Speed | Requires |
|---|
| StatisticalConfusable | 24 | Bigram ratio comparison | ~0.3ms | Dictionary DB |
| ConfusableCompoundClassifier | 47 | MLP binary classifier (ONNX) | ~1ms | ONNX model |
| ConfusableSemantic | 48 | MLM logit comparison | ~15ms | Semantic model |
Each strategy targets different confusable types and operates independently, so you can enable any combination based on your accuracy/speed requirements.
Confusable Sources
Confusable pairs come from two sources:
- Database (
confusable_pairs table): ~21K pairs mined from corpus data during the enrichment pipeline. Covers aspiration swaps, medial confusion, nasal endings, and tone mark variants.
- YAML (
rules/confusable_pairs.yaml): Curated fallback pairs that corpus mining cannot discover.
# Confusable pairs are loaded automatically when using SQLiteProvider
from myspellchecker.providers import SQLiteProvider
provider = SQLiteProvider(database_path="path/to/dictionary.db")
# provider.get_confusable_pairs("ကြောင်း") → {"ကျောင်း", ...}
Statistical Confusable Strategy
The fastest confusable detector. Compares bidirectional bigram probabilities to determine if a confusable variant fits the context better than the current word.
How It Works
- For each word, look up known confusable variants from the database
- Compare bidirectional bigram ratios:
- Left context: P(variant | previous_word) vs P(word | previous_word)
- Right context: P(next_word | variant) vs P(next_word | word)
- If the combined ratio exceeds the threshold, flag as confusable error
Configuration
from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig
# Statistical confusable is enabled by default when use_context_checker=True
config = SpellCheckerConfig(
use_context_checker=True
)
The strategy runs at priority 24 (within the structural phase), so it executes even when the fast-path is enabled. This ensures confusable errors on “structurally clean” text are not skipped.
Parameters
| Parameter | Default | Description |
|---|
threshold | 5.0 | Minimum bidirectional bigram ratio to trigger detection |
confidence | 0.85 | Confidence score assigned to detected errors |
Confusable Compound Classifier
An MLP binary classifier (ONNX) that detects confusable pairs and broken compounds using 22 extracted features including frequency, N-gram, PMI, POS tags, and morphological patterns.
Features Used
The classifier extracts features such as:
- Word and variant frequencies (log-scaled)
- Bigram probabilities in both directions
- PMI (Pointwise Mutual Information) with neighbors
- POS tag compatibility
- Morphological pattern indicators (title suffixes, compound markers)
Configuration
# Requires an ONNX classifier model
# The model path is configured via the builder or config
from myspellchecker.core import SpellCheckerBuilder
checker = (
SpellCheckerBuilder()
.with_confusable_classifier("path/to/classifier.onnx")
.build()
)
Confusable Semantic Strategy (MLM)
The most accurate but slowest confusable detector. Uses a masked language model to compare the contextual fit of the current word against its confusable variants.
How It Works
- For each valid word, generate confusable variants (character substitutions, medial swaps)
- Filter variants to valid dictionary words
- Use MLM
predict_mask() to get logits for both the current word and the best variant
- If the logit difference exceeds the threshold, flag as confusable error
Example
Input: ကျွန်တော် ကြောင်းကို သွားတယ်။
^^^^^^^^
- "ကြောင်း" (cat) is valid — passes all rule-based checks
- MLM predicts "ကျောင်း" (school) with much higher logit in "went to [X]" context
- logit_diff exceeds threshold → flagged as confusable_error
- Suggestion: ကျောင်း (school)
Configuration
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig
# Requires semantic model to be loaded
config = SpellCheckerConfig(
semantic=SemanticConfig(
model_path="path/to/semantic-model",
enabled=True,
)
)
Parameters
| Parameter | Default | Description |
|---|
logit_threshold | 2.0 | Minimum logit difference to trigger detection |
confidence | 0.85 | Confidence score for confusable errors |
Guards and Filters
The strategy applies several guards to reduce false positives:
- Exempt pairs: Known pairs that should not trigger (e.g., particles with overlapping usage)
- Variant blocklist: Specific variants excluded from detection
- Medial-only pairs: Pairs differing only in medial consonants use adjusted thresholds
- Tone-only pairs: Pairs differing only in tone marks use higher thresholds
- DB suppression: Pairs explicitly suppressed in the database
Integration with SpellChecker
All confusable strategies are automatically configured when using the SpellCheckerBuilder:
from myspellchecker.core import SpellCheckerBuilder, ConfigPresets
checker = (
SpellCheckerBuilder()
.with_config(ConfigPresets.ACCURATE)
.build()
)
result = checker.check("ကျွန်တော် ကြောင်းကို သွားတယ်။")
for error in result.errors:
if error.error_type == "confusable_error":
print(f"{error.text} → {error.suggestions[0]}")
| Strategy | Latency | Memory | Accuracy |
|---|
| Statistical | ~0.3ms/word | Minimal | Good for high-frequency pairs |
| MLP Classifier | ~1ms/word | ~5MB model | Good for compound confusables |
| MLM Semantic | ~15ms/word | ~71MB model | Best for context-dependent pairs |
See Also