The Segmenter Interface
mySpellChecker uses a pluggableSegmenter abstract base class (src/myspellchecker/segmenters/base.py).
Default Implementation
The default implementation (DefaultSegmenter) uses a Hybrid Approach combining rule-based speed with ML-powered accuracy.
1. Syllable Segmentation (Optimized)
- Algorithm: Uses a Cython-optimized C++ implementation of syllable breaking rules (adapted from Sylbreak) for maximum performance. Falls back to Regex matching if the extension is unavailable.
- Noise Reduction:
- Unicode Normalization: Automatically applies NFC normalization and reorders diacritics (e.g., Medial Ra after Virama) using fast C++ routines.
- Validity Filtering: Checks against strict linguistical rules to discard “junk” syllables (e.g., floating diacritics, impossible stacks).
- Performance: Extremely fast (<1ms/sentence).
2. Sentence & Word Segmentation (Pluggable Engines)
The segmenter supports multiple engines for word segmentation:- myWord (Default & Recommended): Uses a Viterbi algorithm with unigram/bigram probabilities. It provides the best balance of speed and accuracy, especially for particle handling.
- CRF: Uses Conditional Random Fields (via
python-crfsuite). Good for general text but slower than myWord.
Joint Segmentation and POS Tagging
For advanced use cases, mySpellChecker offers joint segmentation-tagging - a unified Viterbi decoder that simultaneously optimizes word boundaries AND POS tags in a single pass.Benefits Over Sequential Processing
The traditional pipeline segments text first, then tags the resulting words. This can lead to suboptimal results when:- Segmentation ambiguity depends on POS context (e.g., particle boundaries)
- Multiple valid segmentations exist with different tag sequences
Enabling Joint Mode
Direct Usage
You can also use theJointSegmentTagger class directly:
Configuration Options
| Parameter | Default | Description |
|---|---|---|
enabled | False | Enable joint segmentation-tagging mode |
beam_width | 15 | Number of hypotheses to keep (higher = better but slower) |
max_word_length | 20 | Maximum word length in characters |
emission_weight | 1.2 | Weight for tag emission probabilities |
word_score_weight | 1.0 | Weight for word n-gram scores |
min_prob | 1e-10 | Minimum probability for smoothing |
use_morphology_fallback | True | Use morphology analysis for OOV words |
Custom Segmenters
You can bring your own segmentation logic by subclassingSegmenter and passing it to SpellChecker.