Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Traditional spell checking compares an input word against every dictionary entry, which is far too slow for real-time use. SymSpell avoids this by pre-computing delete variants at build time, turning each lookup into a constant-time hash table hit. This page walks through the algorithm with Myanmar language examples.
The Problem: Traditional Edit Distance
Traditional spell checking calculates the Levenshtein edit distance between an input word and every word in the dictionary:
Input: "မနမာ" (misspelled)
Compare with dictionary:
"မြန်မာ" → edit distance = 2
"မြန်" → edit distance = 3
"မာန" → edit distance = 2
... (repeat for 100,000+ words)
Problem: This is O(N × M) where N = dictionary size, M = word length. For a 100,000-word dictionary, this means 100,000 comparisons per lookup.
The Solution: Symmetric Delete
SymSpell’s key insight: Instead of comparing words directly, pre-compute all possible deletions.
Why “Symmetric”?
If we delete characters from both the dictionary word AND the misspelled word, they’ll meet in the middle:
Dictionary word: "မြန်မာ" (Myanmar)
↓ delete "ြ"
"မန်မာ"
↑ delete "န"
Misspelled word: "မနမာ"
Both reach the same intermediate form through deletions!
How It Works
Indexing (Build Time)
For each word in the dictionary, generate all possible deletions up to max_edit_distance (typically 2):Dictionary word: "မြန်မာ" (Myanmar)
Generate deletes (distance 1):
Delete position 0: "ြန်မာ" (remove မ)
Delete position 1: "မန်မာ" (remove ြ)
Delete position 2: "မြမာ" (remove န်)
Delete position 3: "မြန်မ" (remove ာ)
Delete position 4: "မြန်မာ" (remove invisible char)
Generate deletes (distance 2):
"န်မာ", "ြမာ", "မမာ", "မြမ", ...
Store in hash map:Index = {
"မန်မာ": ["မြန်မာ"],
"ြန်မာ": ["မြန်မာ"],
"မြမာ": ["မြန်မာ"],
"န်မာ": ["မြန်မာ", "ကန်မာ"], // Multiple words can map here
...
}
Lookup (Query Time)
When checking a misspelled word, generate its deletions too:Input: "မနမာ" (misspelled - missing ြ and န်)
Generate deletes:
Distance 0: "မနမာ" (the word itself)
Distance 1: "နမာ", "မမာ", "မနာ", "မနမ"
Distance 2: "မာ", "နာ", "မမ", ...
Match
Look up each deletion in the pre-built index:Lookup "မနမာ" → Not found
Lookup "နမာ" → Not found
Lookup "မမာ" → Found! Points to ["မြန်မာ"]
Result: Found candidate “မြန်မာ” in O(1) hash lookup!
Visual Example: Syllable Correction
Misspelled syllable: "ကျေင်း" (wrong vowel: ေ instead of ော)
Correct syllable: "ကျောင်း" (school)
Index contains:
┌──────────────────────────────────────────────────────┐
│ Deletion → Original Words │
├──────────────────────────────────────────────────────┤
│ "ကျင်း" → ["ကျောင်း", "ကျင်း"] │
│ "ကျောင်" → ["ကျောင်း"] │
│ "ကျေင်း" → ["ကျောင်း"] ← Direct match! │
└──────────────────────────────────────────────────────┘
Lookup process:
1. Check "ကျေင်း" in index → Found! Candidate: "ကျောင်း"
2. Verify edit distance: "ကျေင်း" → "ကျောင်း" = 1 (vowel substitution)
3. Return suggestion: "ကျောင်း"
Visual Example: Word Correction
Misspelled word: "မယ်နမာ" (incorrect spelling of Myanmar)
Correct word: "မြန်မာ"
Step 1: Generate deletions from "မယ်နမာ"
┌──────────────────────────────────────┐
│ Distance 1 deletions: │
│ "ယ်နမာ", "မနမာ", "မယ်မာ", "မယ်နာ" │
├──────────────────────────────────────┤
│ Distance 2 deletions: │
│ "နမာ", "ယ်မာ", "မမာ", "မနာ", ... │
└──────────────────────────────────────┘
Step 2: Lookup in index
┌─────────────────────────────────────────────────┐
│ "မနမာ" → Not in index │
│ "နမာ" → Not in index │
│ "မမာ" → Found! → ["မြန်မာ"] │
└─────────────────────────────────────────────────┘
Step 3: Verify and rank
Candidate: "မြန်မာ"
Edit distance from "မယ်နမာ": 3
Frequency: 50,000 (very common)
→ High confidence suggestion
Myanmar-Specific Considerations
Character Clusters
Myanmar characters often form clusters (consonant + medials + vowels). SymSpell treats each Unicode code point as a unit:
Word: "ကြောင်" (cat)
Unicode breakdown:
က (U+1000) - Ka
ြ (U+103C) - Medial Ra
ေ (U+1031) - Vowel E
ာ (U+102C) - Vowel Aa
င (U+1004) - Nga
် (U+103A) - Asat
Deletions are per code point:
"ြောင်" (delete က)
"ကောင်" (delete ြ)
"ကြာင်" (delete ေ, distance 1)
"ကြေင်" (delete ာ, distance 1)
...
Common Myanmar Typos SymSpell Catches
| Error Type | Misspelled | Correct | Edit Distance |
|---|
| Missing medial | ”ကောင်" | "ကြောင်” | 1 |
| Wrong medial | ”ကျောင်" | "ကြောင်” | 2 (substitute) |
| Missing asat | ”မြနမာ" | "မြန်မာ” | 1 |
| Missing tone | ”ကြီ" | "ကြီး” | 1 |
| Extra medial | ”ကျောင့်" | "ကျောင်း” | 1 |
Syllable vs Word Level
mySpellChecker applies SymSpell at two levels:
1. Syllable Level (faster, catches 90% of errors):
# Syllable correction
"ကျေင်း" → "ကျောင်း" # Wrong vowel fix (school)
"မြန" → "မြန်" # Missing asat
2. Word Level (slower, for complex errors):
# Word correction
"မယ်နမာ" → "မြန်မာ" # Multiple character errors
"ကျေးဇူ" → "ကျေးဇူး" # Missing visarga
Time Complexity
| Operation | Traditional | SymSpell |
|---|
| Single lookup | O(N × M) | O(1) average |
| Build index | - | O(V × L^d) |
Where:
- V = vocabulary size
- L = average term length
- d = max edit distance
Space Complexity
SymSpell trades memory for speed:
Dictionary size: 100,000 words
Average deletions per word: ~50 (at distance 2)
Index size: ~5 million entries
Memory usage: ~50-300 MB (varies by dictionary size and max_edit_distance)
Benchmark: Myanmar Dictionary
| Dictionary Size | Index Build Time | Lookup Time | Memory |
|---|
| 10,000 words | 0.5s | <1ms | ~15MB |
| 50,000 words | 2s | <1ms | ~50MB |
| 100,000 words | 5s | <1ms | ~50-100MB |
Source docstring notes: “Typical Myanmar corpus (100K terms, d=2): ~50-100MB index.” Memory grows with max_edit_distance — values above 2 cause exponential growth.
Configuration
SpellCheckerConfig Options
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SymSpellConfig
config = SpellCheckerConfig(
# Maximum edit distance for suggestions
max_edit_distance=2, # Default: 2 (covers 99% of typos)
# SymSpell-specific settings
symspell=SymSpellConfig(
prefix_length=10, # Optimize index size (default)
)
)
checker = SpellChecker(config=config)
Edit Distance Guidelines
| Distance | Coverage | Speed | Use Case |
|---|
| 1 | ~85% | Fastest | Real-time typing |
| 2 | ~99% | Fast | Default, recommended |
| 3 | ~99.9% | Slower | Batch processing |
Prefix Length
The prefix_length parameter optimizes memory by only indexing the first N characters:
# prefix_length=10 (default in SymSpell class)
# Only generates deletions for first N characters
# Reduces index size by ~40% with minimal accuracy loss
# For Myanmar (shorter words), 5-7 is optimal
config = SpellCheckerConfig(
symspell=SymSpellConfig(prefix_length=5)
)
Implementation Details
Index Structure
# Internal index structure
# Maps delete variations to sets of (original_term, edit_distance) tuples
self._deletes: Dict[str, Set[Tuple[str, int]]] = {
"မန်မာ": {
("မြန်မာ", 1), # original term, deletion distance
},
"ကောင်": {
("ကြောင်", 1),
("ကောင်း", 1),
},
...
}
Lookup Algorithm
def lookup(word: str, max_distance: int = 2) -> list[Suggestion]:
candidates = set()
# Check the word itself
if word in dictionary:
candidates.add(word)
# Generate and check deletions
for deletion in generate_deletes(word, max_distance):
if deletion in index:
for original in index[deletion]:
# Verify actual edit distance
dist = edit_distance(word, original)
if dist <= max_distance:
candidates.add((original, dist))
# Rank using multi-factor scoring (edit distance, frequency,
# phonetic similarity, syllable match, etc.) via SuggestionRanker
return ranker.rank(candidates)
Suggestion Dataclass
Each suggestion returned by lookup() includes:
| Field | Type | Description |
|---|
term | str | The suggested correction |
edit_distance | int | Damerau-Levenshtein distance from input |
frequency | int | Corpus frequency of the term |
phonetic_score | float | Phonetic similarity score (0.0-1.0) |
syllable_distance | float | None | Myanmar syllable-aware weighted distance |
weighted_distance | float | None | Myanmar-weighted edit distance using substitution costs |
is_nasal_variant | bool | True if nasal ending difference only (န်↔ံ) |
has_same_nasal_ending | bool | True if same nasal consonant as input |
Additional Methods
| Method | Description |
|---|
build_index(levels) | Build delete index for specified levels (["syllable"], ["word"], or both) |
lookup_compound(term) | Compound word segmentation with suggestions |
compute_frequency_denominator(provider, level, percentile) | Static method to compute optimal frequency denominator from corpus statistics |
Comparison with Other Algorithms
| Algorithm | Lookup Speed | Index Size | Accuracy |
|---|
| Brute Force | O(N×M) | Small | 100% |
| BK-Tree | O(log N) | Medium | 100% |
| SymSpell | O(1) | Large | 100% |
| Norvig’s | O(26^d × M) | Small | 100% |
SymSpell is the fastest for dictionary-based spell checking, making it ideal for real-time applications.
See Also