SpellChecker behavior is controlled by the SpellCheckerConfig object. You can adjust performance thresholds, toggle features, and fine-tune algorithm sensitivity through nested configuration classes.
Usage
The easiest way to configure the spell checker is usingConfigPresets with the SpellCheckerBuilder.
Configuration Presets
| Preset | Description |
|---|---|
ConfigPresets.DEFAULT | Balanced configuration suitable for most use cases. |
ConfigPresets.FAST | Optimized for speed. Disables context checking and reduces search depth. |
ConfigPresets.ACCURATE | Optimized for quality. Max edit distance 3, strict thresholds. |
ConfigPresets.MINIMAL | Only basic syllable validation. Lowest resource usage. |
ConfigPresets.STRICT | Conservative thresholds to minimize false positives. |
Configuration Files
You can load configuration from a YAML file instead of code. This is useful for deploying the spell checker in different environments. Load Order:- Explicit path loaded via
ConfigLoader().load(config_file="path/to/config.yml"). myspellchecker.yaml,myspellchecker.yml, ormyspellchecker.jsonin the current directory.~/.config/myspellchecker/myspellchecker.yaml,myspellchecker.yml, ormyspellchecker.json(User global config).
myspellchecker.yaml:
Configuration Parameters
General Settings
Maximum edit distance for suggestions (1-3). Higher values find more suggestions but are slower.
Maximum number of correction suggestions to return per error.
Enable phonetic matching (Myanmar Soundex-like) for finding sound-alike corrections.
Enable N-gram context validation for detecting real-word errors.
Enable Named Entity Recognition heuristics to skip proper names.
Enable algorithmic syllable structure checks.
Word segmentation engine:
"myword", "crf", or "transformer".Silently use empty provider if database not found (instead of raising error).
Nested Configuration Objects
SymSpell algorithm configuration (edit distance, prefix length, beam width).
N-gram context checker configuration (thresholds, smoothing, scoring weights).
Phonetic matching configuration (code length, suggestion thresholds).
Semantic model configuration (model path, tokenizer, inference settings).
POS tagger configuration (tagger type, model name, device).
Joint segmentation-tagging configuration (beam width, emission weight).
Validation behavior configuration (confidence thresholds, feature toggles).
Provider caching and query configuration (cache size, timeout).
Unified cache size configuration for all algorithm lookup caches.
Suggestion ranking weights and strategy selection.
NER model configuration. When provided with
enabled=True, uses specified NER model.SymSpell Settings
Controlled via thesymspell attribute (SymSpellConfig).
Context & N-gram Settings
Controlled via thengram_context attribute (NgramContextConfig).
Phonetic Settings
Controlled via thephonetic attribute (PhoneticConfig).
Semantic Model Settings
Controlled via thesemantic attribute (SemanticConfig). Requires a trained model.
Proactive Semantic Scanning
When enabled, the spell checker will proactively scan sentences for semantic errors using a language model (XLM-RoBERTa, mDeBERTa, etc.). This can detect errors that traditional dictionary-based methods miss.POS Tagger Settings
Controlled via thepos_tagger attribute (POSTaggerConfig).
Validation & Error Detection Settings
Controlled via thevalidation attribute (ValidationConfig).
Provider Settings
Controlled via theprovider_config attribute (ProviderConfig).
| Parameter | Default | Description |
|---|---|---|
cache_size | 1024 | LRU cache size for database queries. |
pool_min_size | 1 | Minimum connections in pool. |
pool_max_size | 5 | Maximum connections in pool (smaller is better for SQLite). |
pool_timeout | 5.0 | Connection checkout timeout in seconds. |
pool_max_connection_age | 3600.0 | Max connection age before recreation (seconds). |
Connection Pooling
Connection pooling manages SQLite database connections for better resource control and production safety. The library uses connection pooling by default to ensure robust production behavior. Benefits:- Resource control with hard connection limits
- Connection health monitoring and automatic recreation
- Observability through pool statistics
- Graceful degradation under load
- Keep
pool_max_sizesmall (2-5) to reduce lock contention - Set
pool_min_size=1for most cases - Larger pools (>10) degrade performance due to lock contention
- Pooling adds ~30-50% overhead compared to direct connections
- Overhead comes from queue operations, locking, and health checks
- Trade-off: Performance vs. resource control and production safety
- See
tests/test_connection_pool.pyfor comprehensive test coverage
Joint Segmentation-Tagging Settings
Thejoint parameter accepts a JointConfig object for unified word segmentation and POS tagging.
| Parameter | Default | Description |
|---|---|---|
enabled | False | Enable joint segmentation-tagging mode. |
beam_width | 15 | Number of hypotheses to keep per position. |
max_word_length | 20 | Maximum word length in characters. |
emission_weight | 1.2 | Weight for P(tag|word) emission probabilities. |
word_score_weight | 1.0 | Weight for word n-gram language model scores. |
min_prob | 1e-10 | Minimum probability for smoothing. |
use_morphology_fallback | True | Use morphology for OOV word tag guessing. |
Integrated Features
The following features are automatically integrated into the validation pipeline. Most are enabled by default and work transparently.Particle Typo Detection
Automatically detects common Myanmar particle typos usingPARTICLE_TYPO_PATTERNS. Examples:
ကို့→ကို(object marker)နှင့်→နဲ့(and/with)ပေး့→ပေး(give/for)
Medial Confusion Detection
Catches context-aware ျ vs ြ medial confusion usingMEDIAL_CONFUSION_PATTERNS. For example:
ကြီးvsကျီး(big vs crow)ပြုvsပျု(do vs -)
Morphology OOV Recovery
For out-of-vocabulary (OOV) words, the system attempts to recover the root by stripping common suffixes:- Verb suffixes:
သည်,ခဲ့,မည်,နေ, etc. - Noun suffixes:
များ,တို့, etc.
POS Sequence Validation
Uses ViterbiTagger output to detect invalid POS sequences:- V-V (consecutive verbs without particles)
- P-P (consecutive particles)
- Invalid tag sequences defined in
INVALID_POS_SEQUENCES
Question Detection
Identifies sentence types (question/statement) and validates question particle usage:- Detects question words:
ဘာ,ဘယ်,ဘယ်လို, etc. - Validates question particles:
လား,လဲ,သလဲ, etc.
Unified Suggestion Ranking
Suggestions from different sources are ranked usingUnifiedRanker with source-specific weights:
| Source | Weight | Priority |
|---|---|---|
particle_typo | 1.2 | Highest |
semantic | 1.15 | High |
medial_confusion | 1.1 | Medium-High |
morphology_recovery | 0.9 | Medium |
symspell | 1.0 | Base |
Tone Disambiguation
TheToneDisambiguator provides context-aware correction for commonly confused Myanmar tone marks. Available via:
သား(son vs tiger)ငါ(I/me vs five)ပဲ(only vs bean)