Skip to main content
Myanmar text frequently contains “real-word errors” where one valid word is confused with another visually or phonetically similar word. Unlike misspellings, both the error and the correction exist in the dictionary, making detection much harder. mySpellChecker uses three complementary strategies to catch these errors at different cost/accuracy trade-offs.

Overview

Confusable detection operates through three validation strategies that run in priority order:
StrategyPriorityMethodSpeedRequires
StatisticalConfusable24Bigram ratio comparison~0.3msDictionary DB
ConfusableCompoundClassifier47MLP binary classifier (ONNX)~1msONNX model
ConfusableSemantic48MLM logit comparison~15msSemantic model
Each strategy targets different confusable types and operates independently, so you can enable any combination based on your accuracy/speed requirements.

Confusable Sources

Confusable pairs come from two sources:
  1. Database (confusable_pairs table): ~21K pairs mined from corpus data during the enrichment pipeline. Covers aspiration swaps, medial confusion, nasal endings, and tone mark variants.
  2. YAML (rules/confusable_pairs.yaml): Curated fallback pairs that corpus mining cannot discover.
# Confusable pairs are loaded automatically when using SQLiteProvider
from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider(database_path="path/to/dictionary.db")
# provider.get_confusable_pairs("ကြောင်း") → {"ကျောင်း", ...}

Statistical Confusable Strategy

The fastest confusable detector. Compares bidirectional bigram probabilities to determine if a confusable variant fits the context better than the current word.

How It Works

  1. For each word, look up known confusable variants from the database
  2. Compare bidirectional bigram ratios:
    • Left context: P(variant | previous_word) vs P(word | previous_word)
    • Right context: P(next_word | variant) vs P(next_word | word)
  3. If the combined ratio exceeds the threshold, flag as confusable error

Configuration

from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig

# Statistical confusable is enabled by default when use_context_checker=True
config = SpellCheckerConfig(
    use_context_checker=True
)
The strategy runs at priority 24 (within the structural phase), so it executes even when the fast-path is enabled. This ensures confusable errors on “structurally clean” text are not skipped.

Parameters

ParameterDefaultDescription
threshold5.0Minimum bidirectional bigram ratio to trigger detection
confidence0.85Confidence score assigned to detected errors

Confusable Compound Classifier

An MLP binary classifier (ONNX) that detects confusable pairs and broken compounds using 22 extracted features including frequency, N-gram, PMI, POS tags, and morphological patterns.

Features Used

The classifier extracts features such as:
  • Word and variant frequencies (log-scaled)
  • Bigram probabilities in both directions
  • PMI (Pointwise Mutual Information) with neighbors
  • POS tag compatibility
  • Morphological pattern indicators (title suffixes, compound markers)

Configuration

# Requires an ONNX classifier model
# The model path is configured via the builder or config
from myspellchecker.core import SpellCheckerBuilder

checker = (
    SpellCheckerBuilder()
    .with_confusable_classifier("path/to/classifier.onnx")
    .build()
)

Confusable Semantic Strategy (MLM)

The most accurate but slowest confusable detector. Uses a masked language model to compare the contextual fit of the current word against its confusable variants.

How It Works

  1. For each valid word, generate confusable variants (character substitutions, medial swaps)
  2. Filter variants to valid dictionary words
  3. Use MLM predict_mask() to get logits for both the current word and the best variant
  4. If the logit difference exceeds the threshold, flag as confusable error

Example

Input:  ကျွန်တော် ကြောင်းကို သွားတယ်။
                  ^^^^^^^^
- "ကြောင်း" (cat) is valid — passes all rule-based checks
- MLM predicts "ကျောင်း" (school) with much higher logit in "went to [X]" context
- logit_diff exceeds threshold → flagged as confusable_error
- Suggestion: ကျောင်း (school)

Configuration

from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

# Requires semantic model to be loaded
config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="path/to/semantic-model",
        enabled=True,
    )
)

Parameters

ParameterDefaultDescription
logit_threshold2.0Minimum logit difference to trigger detection
confidence0.85Confidence score for confusable errors

Guards and Filters

The strategy applies several guards to reduce false positives:
  • Exempt pairs: Known pairs that should not trigger (e.g., particles with overlapping usage)
  • Variant blocklist: Specific variants excluded from detection
  • Medial-only pairs: Pairs differing only in medial consonants use adjusted thresholds
  • Tone-only pairs: Pairs differing only in tone marks use higher thresholds
  • DB suppression: Pairs explicitly suppressed in the database

Integration with SpellChecker

All confusable strategies are automatically configured when using the SpellCheckerBuilder:
from myspellchecker.core import SpellCheckerBuilder, ConfigPresets

checker = (
    SpellCheckerBuilder()
    .with_config(ConfigPresets.ACCURATE)
    .build()
)

result = checker.check("ကျွန်တော် ကြောင်းကို သွားတယ်။")

for error in result.errors:
    if error.error_type == "confusable_error":
        print(f"{error.text}{error.suggestions[0]}")

Performance

StrategyLatencyMemoryAccuracy
Statistical~0.3ms/wordMinimalGood for high-frequency pairs
MLP Classifier~1ms/word~5MB modelGood for compound confusables
MLM Semantic~15ms/word~71MB modelBest for context-dependent pairs

See Also