Skip to main content
The SpellChecker behavior is controlled by the SpellCheckerConfig object. You can adjust performance thresholds, toggle features, and fine-tune algorithm sensitivity through nested configuration classes.

Usage

The easiest way to configure the spell checker is using ConfigPresets with the SpellCheckerBuilder.
from myspellchecker.core.builder import SpellCheckerBuilder, ConfigPresets

# Use a preset
checker = (
    SpellCheckerBuilder()
    .with_config(ConfigPresets.ACCURATE)
    .build()
)

# Customize specific options on top of a preset
config = ConfigPresets.DEFAULT
config.max_suggestions = 10

checker = (
    SpellCheckerBuilder()
    .with_config(config)
    .with_phonetic(False)  # Override phonetic setting
    .build()
)

Configuration Presets

PresetDescription
ConfigPresets.DEFAULTBalanced configuration suitable for most use cases.
ConfigPresets.FASTOptimized for speed. Disables context checking and reduces search depth.
ConfigPresets.ACCURATEOptimized for quality. Max edit distance 3, strict thresholds.
ConfigPresets.MINIMALOnly basic syllable validation. Lowest resource usage.
ConfigPresets.STRICTConservative thresholds to minimize false positives.

Configuration Files

You can load configuration from a YAML file instead of code. This is useful for deploying the spell checker in different environments. Load Order:
  1. Explicit path loaded via ConfigLoader().load(config_file="path/to/config.yml").
  2. myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json in the current directory.
  3. ~/.config/myspellchecker/myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json (User global config).
Example myspellchecker.yaml:
preset: accurate
max_suggestions: 10
use_phonetic: true

symspell:
  prefix_length: 10

Configuration Parameters

General Settings

max_edit_distance
int
default:"2"
Maximum edit distance for suggestions (1-3). Higher values find more suggestions but are slower.
max_suggestions
int
default:"5"
Maximum number of correction suggestions to return per error.
use_phonetic
bool
default:"True"
Enable phonetic matching (Myanmar Soundex-like) for finding sound-alike corrections.
use_context_checker
bool
default:"True"
Enable N-gram context validation for detecting real-word errors.
use_ner
bool
default:"True"
Enable Named Entity Recognition heuristics to skip proper names.
use_rule_based_validation
bool
default:"True"
Enable algorithmic syllable structure checks.
word_engine
str
default:"myword"
Word segmentation engine: "myword", "crf", or "transformer".
fallback_to_empty_provider
bool
default:"False"
Silently use empty provider if database not found (instead of raising error).

Nested Configuration Objects

symspell
SymSpellConfig
SymSpell algorithm configuration (edit distance, prefix length, beam width).
ngram_context
NgramContextConfig
N-gram context checker configuration (thresholds, smoothing, scoring weights).
phonetic
PhoneticConfig
Phonetic matching configuration (code length, suggestion thresholds).
semantic
SemanticConfig
Semantic model configuration (model path, tokenizer, inference settings).
pos_tagger
POSTaggerConfig
POS tagger configuration (tagger type, model name, device).
joint
JointConfig
Joint segmentation-tagging configuration (beam width, emission weight).
validation
ValidationConfig
Validation behavior configuration (confidence thresholds, feature toggles).
provider_config
ProviderConfig
Provider caching and query configuration (cache size, timeout).
cache
AlgorithmCacheConfig
Unified cache size configuration for all algorithm lookup caches.
ranker
RankerConfig
Suggestion ranking weights and strategy selection.
ner
NERConfig
default:"None"
NER model configuration. When provided with enabled=True, uses specified NER model.

SymSpell Settings

Controlled via the symspell attribute (SymSpellConfig).

Context & N-gram Settings

Controlled via the ngram_context attribute (NgramContextConfig).

Phonetic Settings

Controlled via the phonetic attribute (PhoneticConfig).

Semantic Model Settings

Controlled via the semantic attribute (SemanticConfig). Requires a trained model.

Proactive Semantic Scanning

When enabled, the spell checker will proactively scan sentences for semantic errors using a language model (XLM-RoBERTa, mDeBERTa, etc.). This can detect errors that traditional dictionary-based methods miss.
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        use_proactive_scanning=True,
        proactive_confidence_threshold=0.7,  # Higher = fewer false positives
    )
)
Note: Proactive scanning requires a trained model and may increase processing time.

POS Tagger Settings

Controlled via the pos_tagger attribute (POSTaggerConfig).

Validation & Error Detection Settings

Controlled via the validation attribute (ValidationConfig).

Provider Settings

Controlled via the provider_config attribute (ProviderConfig).
ParameterDefaultDescription
cache_size1024LRU cache size for database queries.
pool_min_size1Minimum connections in pool.
pool_max_size5Maximum connections in pool (smaller is better for SQLite).
pool_timeout5.0Connection checkout timeout in seconds.
pool_max_connection_age3600.0Max connection age before recreation (seconds).

Connection Pooling

Connection pooling manages SQLite database connections for better resource control and production safety. The library uses connection pooling by default to ensure robust production behavior. Benefits:
  • Resource control with hard connection limits
  • Connection health monitoring and automatic recreation
  • Observability through pool statistics
  • Graceful degradation under load
Pool size recommendations:
  • Keep pool_max_size small (2-5) to reduce lock contention
  • Set pool_min_size=1 for most cases
  • Larger pools (>10) degrade performance due to lock contention
Example configurations:
from myspellchecker.core.config import SpellCheckerConfig, ProviderConfig

# Default configuration
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        pool_min_size=1,
        pool_max_size=5,
    )
)

# High-concurrency configuration
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        pool_min_size=2,
        pool_max_size=10,  # May impact performance
    )
)

# Custom timeout and connection age
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        pool_timeout=10.0,  # Wait up to 10s for connection
        pool_max_connection_age=7200.0,  # Recreate after 2 hours
    )
)
Performance characteristics:
  • Pooling adds ~30-50% overhead compared to direct connections
  • Overhead comes from queue operations, locking, and health checks
  • Trade-off: Performance vs. resource control and production safety
  • See tests/test_connection_pool.py for comprehensive test coverage

Joint Segmentation-Tagging Settings

The joint parameter accepts a JointConfig object for unified word segmentation and POS tagging.
from myspellchecker.core.config import SpellCheckerConfig, JointConfig

config = SpellCheckerConfig(
    joint=JointConfig(
        enabled=True,
        beam_width=15,
    )
)
ParameterDefaultDescription
enabledFalseEnable joint segmentation-tagging mode.
beam_width15Number of hypotheses to keep per position.
max_word_length20Maximum word length in characters.
emission_weight1.2Weight for P(tag|word) emission probabilities.
word_score_weight1.0Weight for word n-gram language model scores.
min_prob1e-10Minimum probability for smoothing.
use_morphology_fallbackTrueUse morphology for OOV word tag guessing.
See Segmentation - Joint Mode for detailed usage.

Integrated Features

The following features are automatically integrated into the validation pipeline. Most are enabled by default and work transparently.

Particle Typo Detection

Automatically detects common Myanmar particle typos using PARTICLE_TYPO_PATTERNS. Examples:
  • ကို့ကို (object marker)
  • နှင့်နဲ့ (and/with)
  • ပေး့ပေး (give/for)
These patterns have 0.90-0.95 confidence and are checked during context validation.

Medial Confusion Detection

Catches context-aware ျ vs ြ medial confusion using MEDIAL_CONFUSION_PATTERNS. For example:
  • ကြီး vs ကျီး (big vs crow)
  • ပြု vs ပျု (do vs -)

Morphology OOV Recovery

For out-of-vocabulary (OOV) words, the system attempts to recover the root by stripping common suffixes:
  • Verb suffixes: သည်, ခဲ့, မည်, နေ, etc.
  • Noun suffixes: များ, တို့, etc.
This improves suggestion quality for inflected forms.

POS Sequence Validation

Uses ViterbiTagger output to detect invalid POS sequences:
  • V-V (consecutive verbs without particles)
  • P-P (consecutive particles)
  • Invalid tag sequences defined in INVALID_POS_SEQUENCES

Question Detection

Identifies sentence types (question/statement) and validates question particle usage:
  • Detects question words: ဘာ, ဘယ်, ဘယ်လို, etc.
  • Validates question particles: လား, လဲ, သလဲ, etc.

Unified Suggestion Ranking

Suggestions from different sources are ranked using UnifiedRanker with source-specific weights:
SourceWeightPriority
particle_typo1.2Highest
semantic1.15High
medial_confusion1.1Medium-High
morphology_recovery0.9Medium
symspell1.0Base

Tone Disambiguation

The ToneDisambiguator provides context-aware correction for commonly confused Myanmar tone marks. Available via:
from myspellchecker.text.tone import ToneDisambiguator

disambiguator = ToneDisambiguator()
corrections = disambiguator.check_sentence(["word1", "word2", ...])
Handles ambiguous words like:
  • သား (son vs tiger)
  • ငါ (I/me vs five)
  • ပဲ (only vs bean)