The Problem: Traditional Edit Distance
Traditional spell checking calculates the Levenshtein edit distance between an input word and every word in the dictionary:The Solution: Symmetric Delete
SymSpell’s key insight: Instead of comparing words directly, pre-compute all possible deletions.Why “Symmetric”?
If we delete characters from both the dictionary word AND the misspelled word, they’ll meet in the middle:How It Works
Indexing (Build Time)
For each word in the dictionary, generate all possible deletions up to Store in hash map:
max_edit_distance (typically 2):Visual Example: Syllable Correction
Visual Example: Word Correction
Myanmar-Specific Considerations
Character Clusters
Myanmar characters often form clusters (consonant + medials + vowels). SymSpell treats each Unicode code point as a unit:Common Myanmar Typos SymSpell Catches
| Error Type | Misspelled | Correct | Edit Distance |
|---|---|---|---|
| Missing medial | ”ကောင်" | "ကြောင်” | 1 |
| Wrong medial | ”ကျောင်" | "ကြောင်” | 2 (substitute) |
| Missing asat | ”မြနမာ" | "မြန်မာ” | 1 |
| Missing tone | ”ကြီ" | "ကြီး” | 1 |
| Extra medial | ”ကျောင့်" | "ကျောင်း” | 1 |
Syllable vs Word Level
mySpellChecker applies SymSpell at two levels: 1. Syllable Level (faster, catches 90% of errors):Performance Characteristics
Time Complexity
| Operation | Traditional | SymSpell |
|---|---|---|
| Single lookup | O(N × M) | O(1) average |
| Build index | - | O(V × L^d) |
- V = vocabulary size
- L = average term length
- d = max edit distance
Space Complexity
SymSpell trades memory for speed:Benchmark: Myanmar Dictionary
| Dictionary Size | Index Build Time | Lookup Time | Memory |
|---|---|---|---|
| 10,000 words | 0.5s | <1ms | ~15MB |
| 50,000 words | 2s | <1ms | ~50MB |
| 100,000 words | 5s | <1ms | ~50-100MB |
max_edit_distance — values above 2 cause exponential growth.
Configuration
SpellCheckerConfig Options
Edit Distance Guidelines
| Distance | Coverage | Speed | Use Case |
|---|---|---|---|
| 1 | ~85% | Fastest | Real-time typing |
| 2 | ~99% | Fast | Default, recommended |
| 3 | ~99.9% | Slower | Batch processing |
Prefix Length
Theprefix_length parameter optimizes memory by only indexing the first N characters:
Implementation Details
Index Structure
Lookup Algorithm
Suggestion Dataclass
Each suggestion returned bylookup() includes:
| Field | Type | Description |
|---|---|---|
term | str | The suggested correction |
edit_distance | int | Damerau-Levenshtein distance from input |
frequency | int | Corpus frequency of the term |
phonetic_score | float | Phonetic similarity score (0.0-1.0) |
syllable_distance | float | None | Myanmar syllable-aware weighted distance |
weighted_distance | float | None | Myanmar-weighted edit distance using substitution costs |
is_nasal_variant | bool | True if nasal ending difference only (န်↔ံ) |
has_same_nasal_ending | bool | True if same nasal consonant as input |
Additional Methods
| Method | Description |
|---|---|
build_index(levels) | Build delete index for specified levels (["syllable"], ["word"], or both) |
lookup_compound(term) | Compound word segmentation with suggestions |
compute_frequency_denominator(provider, level, percentile) | Static method to compute optimal frequency denominator from corpus statistics |
Comparison with Other Algorithms
| Algorithm | Lookup Speed | Index Size | Accuracy |
|---|---|---|---|
| Brute Force | O(N×M) | Small | 100% |
| BK-Tree | O(log N) | Medium | 100% |
| SymSpell | O(1) | Large | 100% |
| Norvig’s | O(26^d × M) | Small | 100% |
See Also
- Edit Distance - Levenshtein distance calculation
- Syllable Validation - Syllable-level checking
- Word Validation - Word-level checking
- Performance Tuning - Optimization strategies