SpellChecker behavior is controlled by the SpellCheckerConfig object. You can adjust performance thresholds, toggle features, and fine-tune algorithm sensitivity through nested configuration classes.
Usage
The easiest way to configure the spell checker is usingConfigPresets with the SpellCheckerBuilder.
Configuration Presets
| Preset | Description |
|---|---|
ConfigPresets.DEFAULT | Balanced configuration suitable for most use cases. |
ConfigPresets.FAST | Optimized for speed. Disables context checking and reduces search depth. |
ConfigPresets.ACCURATE | Optimized for quality. Max edit distance 3, strict thresholds. |
ConfigPresets.MINIMAL | Dictionary-only checking. Disables phonetic, context, NER, and rule-based validation. Lowest resource usage. |
ConfigPresets.STRICT | Sensitive thresholds that catch more potential errors. May increase false positives. Suitable for formal documents. |
Configuration Profiles
For environment-specific configurations, useget_profile() which returns fully-configured SpellCheckerConfig objects tuned for specific use cases.
| Profile | POS Tagger | Context | NER | Semantic | Zawgyi | Key Tuning |
|---|---|---|---|---|---|---|
development | rule_based | Off | Off | Off | On | Small caches, prefix_length=5 |
production | viterbi | On | On | Refinement | On | Standard caches, prefix_length=7 |
testing | rule_based | On | On | Off | On | Small caches for determinism |
fast | rule_based | Off | Off | Off | Off | max_edit_distance=1, count_threshold=200 |
accurate | viterbi | On | On | Proactive | On | max_edit_distance=3, beam_width=100, large caches |
ConfigPresets (from SpellCheckerBuilder) and get_profile() are separate configuration systems. Presets are simpler toggles; profiles provide fully-tuned configurations including SymSpell, N-gram, POS tagger, and provider settings.Configuration Files
You can load configuration from a YAML file instead of code. This is useful for deploying the spell checker in different environments. Load Order:- Explicit path loaded via
ConfigLoader().load(config_file="path/to/config.yml"). myspellchecker.yaml,myspellchecker.yml, ormyspellchecker.jsonin the current directory.~/.config/myspellchecker/myspellchecker.yaml,myspellchecker.yml, ormyspellchecker.json(User global config).
myspellchecker.yaml:
Configuration Parameters
General Settings
Maximum edit distance for suggestions (1-3). Higher values find more suggestions but are slower.
Maximum number of correction suggestions to return per error.
Maximum input text length in characters. Prevents resource exhaustion on very large inputs.
Enable phonetic matching (Myanmar Soundex-like) for finding sound-alike corrections.
Enable N-gram context validation for detecting real-word errors.
Enable Named Entity Recognition heuristics to skip proper names.
Enable algorithmic syllable structure checks.
Word segmentation engine:
"myword", "crf", or "transformer".Custom model name or path for transformer word segmentation. Only used when
word_engine="transformer". Defaults to chuuhtetnaing/myanmar-text-segmentation-model.Device for transformer word segmentation inference.
-1 for CPU, 0+ for GPU index. Only used when word_engine="transformer".Silently use empty provider if database not found (instead of raising error).
Nested Configuration Objects
SymSpell algorithm configuration (edit distance, prefix length, beam width).
N-gram context checker configuration (thresholds, smoothing, scoring weights).
Phonetic matching configuration (code length, suggestion thresholds).
Semantic model configuration (model path, tokenizer, inference settings).
POS tagger configuration (tagger type, model name, device).
Joint segmentation-tagging configuration (beam width, emission weight).
Validation behavior configuration (confidence thresholds, feature toggles).
Provider caching and query configuration (cache size, timeout).
Unified cache size configuration for all algorithm lookup caches.
Suggestion ranking weights and strategy selection.
Centralized frequency thresholds that suppress false positives across validators (colloquial, homophone, N-gram, semantic).
Compound word synthesis and broken compound detection settings.
Reduplication validation settings for AABB/ABAB patterns.
Neural suggestion re-ranking model configuration (MLP with ONNX).
Broken compound detection strategy thresholds and confidence.
Token boundary refinement scoring (exposes hidden errors in merged tokens).
NER model configuration. When provided with
enabled=True, uses specified NER model.SymSpell Settings
Controlled via thesymspell attribute (SymSpellConfig).
Context & N-gram Settings
Controlled via thengram_context attribute (NgramContextConfig).
Phonetic Settings
Controlled via thephonetic attribute (PhoneticConfig).
Semantic Model Settings
Controlled via thesemantic attribute (SemanticConfig). Requires a trained model.
Proactive Semantic Scanning
When enabled, the spell checker will proactively scan sentences for semantic errors using a language model (XLM-RoBERTa, mDeBERTa, etc.). This can detect errors that traditional dictionary-based methods miss.POS Tagger Settings
Controlled via thepos_tagger attribute (POSTaggerConfig).
Validation & Error Detection Settings
Controlled via thevalidation attribute (ValidationConfig).
Provider Settings
Controlled via theprovider_config attribute (ProviderConfig).
| Parameter | Default | Description |
|---|---|---|
cache_size | 1024 | LRU cache size for database queries. |
pool_min_size | 1 | Minimum connections in pool. |
pool_max_size | 5 | Maximum connections in pool (smaller is better for SQLite). |
pool_timeout | 5.0 | Connection checkout timeout in seconds. |
pool_max_connection_age | 3600.0 | Max connection age before recreation (seconds). |
Connection Pooling
Connection pooling manages SQLite database connections for better resource control and production safety. The library uses connection pooling by default to ensure robust production behavior. Benefits:- Resource control with hard connection limits
- Connection health monitoring and automatic recreation
- Observability through pool statistics
- Graceful degradation under load
- Keep
pool_max_sizesmall (2-5) to reduce lock contention - Set
pool_min_size=1for most cases - Larger pools (>10) degrade performance due to lock contention
- Pooling adds ~30-50% overhead compared to direct connections
- Overhead comes from queue operations, locking, and health checks
- Trade-off: Performance vs. resource control and production safety
- See
tests/test_connection_pool.pyfor comprehensive test coverage
Joint Segmentation-Tagging Settings
Thejoint parameter accepts a JointConfig object for unified word segmentation and POS tagging.
| Parameter | Default | Description |
|---|---|---|
enabled | False | Enable joint segmentation-tagging mode. |
beam_width | 15 | Number of hypotheses to keep per position. |
max_word_length | 20 | Maximum word length in characters. |
emission_weight | 1.2 | Weight for P(tag|word) emission probabilities. |
word_score_weight | 1.0 | Weight for word N-gram language model scores. |
min_prob | 1e-10 | Minimum probability for smoothing. |
use_morphology_fallback | True | Use morphology for OOV word tag guessing. |
Frequency Guard Settings
Controlled via thefrequency_guards attribute (FrequencyGuardConfig). Centralized thresholds that suppress false positives across validators.
| Parameter | Default | Description |
|---|---|---|
colloquial_high_freq_suppression | 100000 | Suppress colloquial info for words above this frequency. |
homophone_high_freq | 1000 | Apply stricter homophone ratio above this frequency. |
homophone_high_freq_ratio | 50.0 | Improvement ratio required for high-frequency homophone words. |
ngram_high_freq_guard | 5000 | Suppress N-gram false positives for words above this frequency. |
semantic_high_freq_protection | 50000 | Apply high-frequency logit diff threshold above this frequency. |
Compound Resolver Settings
Controlled via thecompound_resolver attribute (CompoundResolverConfig). Handles compound word synthesis for OOV recovery.
| Parameter | Default | Description |
|---|---|---|
min_morpheme_frequency | 10 | Minimum frequency per morpheme. |
max_parts | 4 | Maximum compound splits (2-8). |
cache_size | 1024 | LRU cache entries. |
base_confidence | 0.85 | Base confidence score for compound matches. |
high_freq_boost | 0.05 | Confidence boost if min morpheme frequency >= 100. |
medium_freq_boost | 0.03 | Confidence boost if min morpheme frequency >= 50. |
extra_parts_penalty | 0.05 | Confidence penalty per extra part beyond 2. |
Reduplication Settings
Controlled via thereduplication attribute (ReduplicationConfig). Validates Myanmar reduplication patterns (AABB, ABAB, rhyme).
| Parameter | Default | Description |
|---|---|---|
min_base_frequency | 5 | Minimum base word frequency for validation. |
cache_size | 1024 | LRU cache entries. |
pattern_confidence_ab | 0.90 | Confidence for AB simple doubling (e.g., ခဏခဏ). |
pattern_confidence_aabb | 0.85 | Confidence for AABB syllable doubling (e.g., သေသေချာချာ). |
pattern_confidence_abab | 0.85 | Confidence for ABAB word repeat. |
pattern_confidence_rhyme | 0.95 | Confidence for rhyme reduplication patterns. |
Neural Reranker Settings
Controlled via theneural_reranker attribute (NeuralRerankerConfig). MLP-based suggestion re-ranking using ONNX.
| Parameter | Default | Description |
|---|---|---|
enabled | False | Enable neural reranking (requires trained model). |
model_path | None | Path to ONNX reranker model. |
stats_path | None | Path to normalization statistics JSON. |
confidence_gap_threshold | 0.15 | Skip reranking when top-2 confidence gap exceeds this. |
max_candidates | 20 | Maximum candidates to score per error. |
Broken Compound Strategy Settings
Controlled via thebroken_compound_strategy attribute (BrokenCompoundStrategyConfig). Tunes the validation strategy that detects incorrectly split compound words.
| Parameter | Default | Description |
|---|---|---|
rare_threshold | 2000 | Frequency below which a word is considered rare. |
compound_min_frequency | 5000 | Minimum compound frequency to flag broken compound. |
compound_ratio | 5.0 | Minimum ratio of compound_freq / rare_word_freq. |
confidence | 0.8 | Default confidence for broken compound errors. |
both_high_freq | 5000 | Frequency guard for multi-syllable both-high compounds. |
min_compound_len | 4 | Minimum compound length for both-high-freq guard. |
Token Refinement Settings
Controlled via thetoken_refinement attribute (TokenRefinementConfig). Tunes the validation-time token-lattice refinement pass that exposes hidden error spans in merged tokens (e.g., particle attachment, negation attachment).
| Parameter | Default | Description |
|---|---|---|
suffix_score_boost | 0.85 | Score boost when suffix matches a known form. |
known_part_score | 1.35 | Score for known dictionary parts. |
unknown_long_part_penalty | 0.45 | Penalty for unknown long parts. |
split_complexity_penalty | 0.30 | Penalty for complex multi-part splits. |
bigram_scale | 120000.0 | Scaling factor for bigram probability contribution. |
min_token_len | 3 | Minimum token length for refinement candidates. |
keep_if_freq_at_least | 2000 | Keep token if frequency is at least this value. |
min_score_gain | 0.55 | Minimum score improvement to accept a split. |
lattice_max_paths | 2 | Maximum lattice paths to consider. |
syllable_split_min_token_len | 4 | Minimum token length for syllable-level splitting. |
syllable_split_max_syllables | 6 | Maximum syllables for syllable-level splitting. |
Integrated Features
The following features are automatically integrated into the validation pipeline. Most are enabled by default and work transparently.Particle Typo Detection
Automatically detects common Myanmar particle typos usingPARTICLE_TYPO_PATTERNS. Examples:
တယ→တယ်(statement ending, missing asat)နဲ→နဲ့(with, missing tone)သလာ→သလား(question, missing tone)
Medial Confusion Detection
Catches context-aware ျ vs ြ medial confusion usingMEDIAL_CONFUSION_PATTERNS. For example:
ကြီးvsကျီး(big vs crow)ပြုvsပျု(do vs -)
Morphology OOV Recovery
For out-of-vocabulary (OOV) words, the system attempts to recover the root by stripping common suffixes:- Verb suffixes:
သည်,ခဲ့,မည်,နေ, etc. - Noun suffixes:
များ,တို့, etc.
POS Sequence Validation
Uses ViterbiTagger output to detect invalid POS sequences:- V-V (consecutive verbs without particles)
- P-P (consecutive particles)
- Invalid tag sequences defined in
INVALID_POS_SEQUENCES
Question Detection
Identifies sentence types (question/statement) and validates question particle usage:- Detects question words:
ဘာ,ဘယ်,ဘယ်လို, etc. - Validates question particles:
လား,လဲ,သလဲ, etc.
Unified Suggestion Ranking
Suggestions from different sources are ranked usingUnifiedRanker with source-specific weights:
| Source | Weight | Priority |
|---|---|---|
particle_typo | 1.2 | Highest |
semantic | 1.15 | High |
context | 1.15 | High |
medial_confusion | 1.1 | Medium-High |
medial_swap | 1.0 | Base |
question_structure | 1.0 | Base |
symspell | 1.0 | Base |
compound | 0.95 | Medium |
morphology | 0.9 | Medium |
morpheme | 0.85 | Medium-Low |
pos_sequence | 0.85 | Medium-Low |
Tone Disambiguation
TheToneDisambiguator provides context-aware correction for commonly confused Myanmar tone marks. Available via:
သား(son, disambiguated by family context patterns)ငါ(I/me vsငါးfish, detects missing visarga in numeral contexts)ပဲ(only vs bean)