Skip to main content
The SpellChecker behavior is controlled by the SpellCheckerConfig object. You can adjust performance thresholds, toggle features, and fine-tune algorithm sensitivity through nested configuration classes.

Usage

The easiest way to configure the spell checker is using ConfigPresets with the SpellCheckerBuilder.
from myspellchecker.core.builder import SpellCheckerBuilder, ConfigPresets

# Use a preset
checker = (
    SpellCheckerBuilder()
    .with_config(ConfigPresets.ACCURATE)
    .build()
)

# Customize specific options on top of a preset
config = ConfigPresets.DEFAULT
config.max_suggestions = 10

checker = (
    SpellCheckerBuilder()
    .with_config(config)
    .with_phonetic(False)  # Override phonetic setting
    .build()
)

Configuration Presets

PresetDescription
ConfigPresets.DEFAULTBalanced configuration suitable for most use cases.
ConfigPresets.FASTOptimized for speed. Disables context checking and reduces search depth.
ConfigPresets.ACCURATEOptimized for quality. Max edit distance 3, strict thresholds.
ConfigPresets.MINIMALDictionary-only checking. Disables phonetic, context, NER, and rule-based validation. Lowest resource usage.
ConfigPresets.STRICTSensitive thresholds that catch more potential errors. May increase false positives. Suitable for formal documents.

Configuration Profiles

For environment-specific configurations, use get_profile() which returns fully-configured SpellCheckerConfig objects tuned for specific use cases.
from myspellchecker.core.config import get_profile

# Use a profile directly
config = get_profile("production")

# Available profiles
config = get_profile("development")  # Fast iteration, minimal validation
config = get_profile("production")   # Balanced accuracy and performance (default)
config = get_profile("testing")      # Deterministic, reproducible results
config = get_profile("fast")         # Maximum speed, reduced accuracy
config = get_profile("accurate")     # Maximum accuracy, slower performance
ProfilePOS TaggerContextNERSemanticZawgyiKey Tuning
developmentrule_basedOffOffOffOnSmall caches, prefix_length=5
productionviterbiOnOnRefinementOnStandard caches, prefix_length=7
testingrule_basedOnOnOffOnSmall caches for determinism
fastrule_basedOffOffOffOffmax_edit_distance=1, count_threshold=200
accurateviterbiOnOnProactiveOnmax_edit_distance=3, beam_width=100, large caches
ConfigPresets (from SpellCheckerBuilder) and get_profile() are separate configuration systems. Presets are simpler toggles; profiles provide fully-tuned configurations including SymSpell, N-gram, POS tagger, and provider settings.

Configuration Files

You can load configuration from a YAML file instead of code. This is useful for deploying the spell checker in different environments. Load Order:
  1. Explicit path loaded via ConfigLoader().load(config_file="path/to/config.yml").
  2. myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json in the current directory.
  3. ~/.config/myspellchecker/myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json (User global config).
Example myspellchecker.yaml:
preset: accurate
max_suggestions: 10
use_phonetic: true

symspell:
  prefix_length: 10

Configuration Parameters

General Settings

max_edit_distance
int
default:"2"
Maximum edit distance for suggestions (1-3). Higher values find more suggestions but are slower.
max_suggestions
int
default:"5"
Maximum number of correction suggestions to return per error.
max_text_length
int
default:"100000"
Maximum input text length in characters. Prevents resource exhaustion on very large inputs.
use_phonetic
bool
default:"True"
Enable phonetic matching (Myanmar Soundex-like) for finding sound-alike corrections.
use_context_checker
bool
default:"True"
Enable N-gram context validation for detecting real-word errors.
use_ner
bool
default:"True"
Enable Named Entity Recognition heuristics to skip proper names.
use_rule_based_validation
bool
default:"True"
Enable algorithmic syllable structure checks.
word_engine
str
default:"myword"
Word segmentation engine: "myword", "crf", or "transformer".
seg_model
str | None
default:"None"
Custom model name or path for transformer word segmentation. Only used when word_engine="transformer". Defaults to chuuhtetnaing/myanmar-text-segmentation-model.
seg_device
int
default:"-1"
Device for transformer word segmentation inference. -1 for CPU, 0+ for GPU index. Only used when word_engine="transformer".
fallback_to_empty_provider
bool
default:"False"
Silently use empty provider if database not found (instead of raising error).

Nested Configuration Objects

symspell
SymSpellConfig
SymSpell algorithm configuration (edit distance, prefix length, beam width).
ngram_context
NgramContextConfig
N-gram context checker configuration (thresholds, smoothing, scoring weights).
phonetic
PhoneticConfig
Phonetic matching configuration (code length, suggestion thresholds).
semantic
SemanticConfig
Semantic model configuration (model path, tokenizer, inference settings).
pos_tagger
POSTaggerConfig
POS tagger configuration (tagger type, model name, device).
joint
JointConfig
Joint segmentation-tagging configuration (beam width, emission weight).
validation
ValidationConfig
Validation behavior configuration (confidence thresholds, feature toggles).
provider_config
ProviderConfig
Provider caching and query configuration (cache size, timeout).
cache
AlgorithmCacheConfig
Unified cache size configuration for all algorithm lookup caches.
ranker
RankerConfig
Suggestion ranking weights and strategy selection.
frequency_guards
FrequencyGuardConfig
Centralized frequency thresholds that suppress false positives across validators (colloquial, homophone, N-gram, semantic).
compound_resolver
CompoundResolverConfig
Compound word synthesis and broken compound detection settings.
reduplication
ReduplicationConfig
Reduplication validation settings for AABB/ABAB patterns.
neural_reranker
NeuralRerankerConfig
Neural suggestion re-ranking model configuration (MLP with ONNX).
broken_compound_strategy
BrokenCompoundStrategyConfig
Broken compound detection strategy thresholds and confidence.
token_refinement
TokenRefinementConfig
Token boundary refinement scoring (exposes hidden errors in merged tokens).
ner
NERConfig
default:"None"
NER model configuration. When provided with enabled=True, uses specified NER model.

SymSpell Settings

Controlled via the symspell attribute (SymSpellConfig).

Context & N-gram Settings

Controlled via the ngram_context attribute (NgramContextConfig).

Phonetic Settings

Controlled via the phonetic attribute (PhoneticConfig).

Semantic Model Settings

Controlled via the semantic attribute (SemanticConfig). Requires a trained model.

Proactive Semantic Scanning

When enabled, the spell checker will proactively scan sentences for semantic errors using a language model (XLM-RoBERTa, mDeBERTa, etc.). This can detect errors that traditional dictionary-based methods miss.
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        use_proactive_scanning=True,
        proactive_confidence_threshold=0.7,  # Higher = fewer false positives
    )
)
Note: Proactive scanning requires a trained model and may increase processing time.

POS Tagger Settings

Controlled via the pos_tagger attribute (POSTaggerConfig).

Validation & Error Detection Settings

Controlled via the validation attribute (ValidationConfig).

Provider Settings

Controlled via the provider_config attribute (ProviderConfig).
ParameterDefaultDescription
cache_size1024LRU cache size for database queries.
pool_min_size1Minimum connections in pool.
pool_max_size5Maximum connections in pool (smaller is better for SQLite).
pool_timeout5.0Connection checkout timeout in seconds.
pool_max_connection_age3600.0Max connection age before recreation (seconds).

Connection Pooling

Connection pooling manages SQLite database connections for better resource control and production safety. The library uses connection pooling by default to ensure robust production behavior. Benefits:
  • Resource control with hard connection limits
  • Connection health monitoring and automatic recreation
  • Observability through pool statistics
  • Graceful degradation under load
Pool size recommendations:
  • Keep pool_max_size small (2-5) to reduce lock contention
  • Set pool_min_size=1 for most cases
  • Larger pools (>10) degrade performance due to lock contention
Example configurations:
from myspellchecker.core.config import SpellCheckerConfig, ProviderConfig

# Default configuration
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        pool_min_size=1,
        pool_max_size=5,
    )
)

# High-concurrency configuration
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        pool_min_size=2,
        pool_max_size=10,  # May impact performance
    )
)

# Custom timeout and connection age
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        pool_timeout=10.0,  # Wait up to 10s for connection
        pool_max_connection_age=7200.0,  # Recreate after 2 hours
    )
)
Performance characteristics:
  • Pooling adds ~30-50% overhead compared to direct connections
  • Overhead comes from queue operations, locking, and health checks
  • Trade-off: Performance vs. resource control and production safety
  • See tests/test_connection_pool.py for comprehensive test coverage

Joint Segmentation-Tagging Settings

The joint parameter accepts a JointConfig object for unified word segmentation and POS tagging.
from myspellchecker.core.config import SpellCheckerConfig, JointConfig

config = SpellCheckerConfig(
    joint=JointConfig(
        enabled=True,
        beam_width=15,
    )
)
ParameterDefaultDescription
enabledFalseEnable joint segmentation-tagging mode.
beam_width15Number of hypotheses to keep per position.
max_word_length20Maximum word length in characters.
emission_weight1.2Weight for P(tag|word) emission probabilities.
word_score_weight1.0Weight for word N-gram language model scores.
min_prob1e-10Minimum probability for smoothing.
use_morphology_fallbackTrueUse morphology for OOV word tag guessing.
See Segmentation - Joint Mode for detailed usage.

Frequency Guard Settings

Controlled via the frequency_guards attribute (FrequencyGuardConfig). Centralized thresholds that suppress false positives across validators.
ParameterDefaultDescription
colloquial_high_freq_suppression100000Suppress colloquial info for words above this frequency.
homophone_high_freq1000Apply stricter homophone ratio above this frequency.
homophone_high_freq_ratio50.0Improvement ratio required for high-frequency homophone words.
ngram_high_freq_guard5000Suppress N-gram false positives for words above this frequency.
semantic_high_freq_protection50000Apply high-frequency logit diff threshold above this frequency.

Compound Resolver Settings

Controlled via the compound_resolver attribute (CompoundResolverConfig). Handles compound word synthesis for OOV recovery.
ParameterDefaultDescription
min_morpheme_frequency10Minimum frequency per morpheme.
max_parts4Maximum compound splits (2-8).
cache_size1024LRU cache entries.
base_confidence0.85Base confidence score for compound matches.
high_freq_boost0.05Confidence boost if min morpheme frequency >= 100.
medium_freq_boost0.03Confidence boost if min morpheme frequency >= 50.
extra_parts_penalty0.05Confidence penalty per extra part beyond 2.

Reduplication Settings

Controlled via the reduplication attribute (ReduplicationConfig). Validates Myanmar reduplication patterns (AABB, ABAB, rhyme).
ParameterDefaultDescription
min_base_frequency5Minimum base word frequency for validation.
cache_size1024LRU cache entries.
pattern_confidence_ab0.90Confidence for AB simple doubling (e.g., ခဏခဏ).
pattern_confidence_aabb0.85Confidence for AABB syllable doubling (e.g., သေသေချာချာ).
pattern_confidence_abab0.85Confidence for ABAB word repeat.
pattern_confidence_rhyme0.95Confidence for rhyme reduplication patterns.

Neural Reranker Settings

Controlled via the neural_reranker attribute (NeuralRerankerConfig). MLP-based suggestion re-ranking using ONNX.
ParameterDefaultDescription
enabledFalseEnable neural reranking (requires trained model).
model_pathNonePath to ONNX reranker model.
stats_pathNonePath to normalization statistics JSON.
confidence_gap_threshold0.15Skip reranking when top-2 confidence gap exceeds this.
max_candidates20Maximum candidates to score per error.

Broken Compound Strategy Settings

Controlled via the broken_compound_strategy attribute (BrokenCompoundStrategyConfig). Tunes the validation strategy that detects incorrectly split compound words.
ParameterDefaultDescription
rare_threshold2000Frequency below which a word is considered rare.
compound_min_frequency5000Minimum compound frequency to flag broken compound.
compound_ratio5.0Minimum ratio of compound_freq / rare_word_freq.
confidence0.8Default confidence for broken compound errors.
both_high_freq5000Frequency guard for multi-syllable both-high compounds.
min_compound_len4Minimum compound length for both-high-freq guard.

Token Refinement Settings

Controlled via the token_refinement attribute (TokenRefinementConfig). Tunes the validation-time token-lattice refinement pass that exposes hidden error spans in merged tokens (e.g., particle attachment, negation attachment).
ParameterDefaultDescription
suffix_score_boost0.85Score boost when suffix matches a known form.
known_part_score1.35Score for known dictionary parts.
unknown_long_part_penalty0.45Penalty for unknown long parts.
split_complexity_penalty0.30Penalty for complex multi-part splits.
bigram_scale120000.0Scaling factor for bigram probability contribution.
min_token_len3Minimum token length for refinement candidates.
keep_if_freq_at_least2000Keep token if frequency is at least this value.
min_score_gain0.55Minimum score improvement to accept a split.
lattice_max_paths2Maximum lattice paths to consider.
syllable_split_min_token_len4Minimum token length for syllable-level splitting.
syllable_split_max_syllables6Maximum syllables for syllable-level splitting.

Integrated Features

The following features are automatically integrated into the validation pipeline. Most are enabled by default and work transparently.

Particle Typo Detection

Automatically detects common Myanmar particle typos using PARTICLE_TYPO_PATTERNS. Examples:
  • တယတယ် (statement ending, missing asat)
  • နဲနဲ့ (with, missing tone)
  • သလာသလား (question, missing tone)
These patterns have 0.90-0.95 confidence and are checked during context validation.

Medial Confusion Detection

Catches context-aware ျ vs ြ medial confusion using MEDIAL_CONFUSION_PATTERNS. For example:
  • ကြီး vs ကျီး (big vs crow)
  • ပြု vs ပျု (do vs -)

Morphology OOV Recovery

For out-of-vocabulary (OOV) words, the system attempts to recover the root by stripping common suffixes:
  • Verb suffixes: သည်, ခဲ့, မည်, နေ, etc.
  • Noun suffixes: များ, တို့, etc.
This improves suggestion quality for inflected forms.

POS Sequence Validation

Uses ViterbiTagger output to detect invalid POS sequences:
  • V-V (consecutive verbs without particles)
  • P-P (consecutive particles)
  • Invalid tag sequences defined in INVALID_POS_SEQUENCES

Question Detection

Identifies sentence types (question/statement) and validates question particle usage:
  • Detects question words: ဘာ, ဘယ်, ဘယ်လို, etc.
  • Validates question particles: လား, လဲ, သလဲ, etc.

Unified Suggestion Ranking

Suggestions from different sources are ranked using UnifiedRanker with source-specific weights:
SourceWeightPriority
particle_typo1.2Highest
semantic1.15High
context1.15High
medial_confusion1.1Medium-High
medial_swap1.0Base
question_structure1.0Base
symspell1.0Base
compound0.95Medium
morphology0.9Medium
morpheme0.85Medium-Low
pos_sequence0.85Medium-Low

Tone Disambiguation

The ToneDisambiguator provides context-aware correction for commonly confused Myanmar tone marks. Available via:
from myspellchecker.text.tone import ToneDisambiguator

disambiguator = ToneDisambiguator()
corrections = disambiguator.check_sentence(["word1", "word2", ...])
Handles ambiguous words like:
  • သား (son, disambiguated by family context patterns)
  • ငါ (I/me vs ငါး fish, detects missing visarga in numeral contexts)
  • ပဲ (only vs bean)