SymSpell Algorithm - mySpellChecker

Traditional spell checking compares an input word against every dictionary entry, which is far too slow for real-time use. SymSpell avoids this by pre-computing delete variants at build time, turning each lookup into a constant-time hash table hit. This page walks through the algorithm with Myanmar language examples.

The Problem: Traditional Edit Distance

Traditional spell checking calculates the Levenshtein edit distance between an input word and every word in the dictionary:

Input: "မနမာ" (misspelled)

Compare with dictionary:
  "မြန်မာ" → edit distance = 2
  "မြန်" → edit distance = 3
  "မာန" → edit distance = 2
  ... (repeat for 100,000+ words)

Problem: This is O(N × M) where N = dictionary size, M = word length. For a 100,000-word dictionary, this means 100,000 comparisons per lookup.

The Solution: Symmetric Delete

SymSpell’s key insight: Instead of comparing words directly, pre-compute all possible deletions.

Why “Symmetric”?

If we delete characters from both the dictionary word AND the misspelled word, they’ll meet in the middle:

Dictionary word: "မြန်မာ" (Myanmar)
                      ↓ delete "ြ"
                  "မန်မာ"
                      ↑ delete "န"
Misspelled word:  "မနမာ"

Both reach the same intermediate form through deletions!

How It Works

Indexing (Build Time)

For each word in the dictionary, generate all possible deletions up to max_edit_distance (typically 2):

Dictionary word: "မြန်မာ" (Myanmar)

Generate deletes (distance 1):
  Delete position 0: "ြန်မာ"   (remove မ)
  Delete position 1: "မန်မာ"   (remove ြ)
  Delete position 2: "မြမာ"    (remove န်)
  Delete position 3: "မြန်မ"   (remove ာ)
  Delete position 4: "မြန်မာ"  (remove invisible char)

Generate deletes (distance 2):
  "န်မာ", "ြမာ", "မမာ", "မြမ", ...

Store in hash map:

Index = {
  "မန်မာ": ["မြန်မာ"],
  "ြန်မာ": ["မြန်မာ"],
  "မြမာ": ["မြန်မာ"],
  "န်မာ": ["မြန်မာ", "ကန်မာ"],  // Multiple words can map here
  ...
}

Lookup (Query Time)

When checking a misspelled word, generate its deletions too:

Input: "မနမာ" (misspelled - missing ြ and န်)

Generate deletes:
  Distance 0: "မနမာ" (the word itself)
  Distance 1: "နမာ", "မမာ", "မနာ", "မနမ"
  Distance 2: "မာ", "နာ", "မမ", ...

Match

Look up each deletion in the pre-built index:

Lookup "မနမာ" → Not found
Lookup "နမာ"  → Not found
Lookup "မမာ"  → Found! Points to ["မြန်မာ"]

Result: Found candidate “မြန်မာ” in O(1) hash lookup!

Visual Example: Syllable Correction

Misspelled syllable: "ကျေင်း" (wrong vowel: ေ instead of ော)
Correct syllable:    "ကျောင်း" (school)

Index contains:
┌──────────────────────────────────────────────────────┐
│  Deletion          →  Original Words                 │
├──────────────────────────────────────────────────────┤
│  "ကျင်း"          →  ["ကျောင်း", "ကျင်း"]          │
│  "ကျောင်"         →  ["ကျောင်း"]                    │
│  "ကျေင်း"         →  ["ကျောင်း"]  ← Direct match!   │
└──────────────────────────────────────────────────────┘

Lookup process:
1. Check "ကျေင်း" in index → Found! Candidate: "ကျောင်း"
2. Verify edit distance: "ကျေင်း" → "ကျောင်း" = 1 (vowel substitution)
3. Return suggestion: "ကျောင်း"

Visual Example: Word Correction

Misspelled word: "မယ်နမာ" (incorrect spelling of Myanmar)
Correct word:    "မြန်မာ"

Step 1: Generate deletions from "မယ်နမာ"
┌──────────────────────────────────────┐
│  Distance 1 deletions:               │
│  "ယ်နမာ", "မနမာ", "မယ်မာ", "မယ်နာ"  │
├──────────────────────────────────────┤
│  Distance 2 deletions:               │
│  "နမာ", "ယ်မာ", "မမာ", "မနာ", ...    │
└──────────────────────────────────────┘

Step 2: Lookup in index
┌─────────────────────────────────────────────────┐
│  "မနမာ" → Not in index                          │
│  "နမာ"  → Not in index                          │
│  "မမာ"  → Found! → ["မြန်မာ"]                   │
└─────────────────────────────────────────────────┘

Step 3: Verify and rank
  Candidate: "မြန်မာ"
  Edit distance from "မယ်နမာ": 3
  Frequency: 50,000 (very common)
  → High confidence suggestion

Myanmar-Specific Considerations

Character Clusters

Myanmar characters often form clusters (consonant + medials + vowels). SymSpell treats each Unicode code point as a unit:

Word: "ကြောင်" (cat)

Unicode breakdown:
  က (U+1000) - Ka
  ြ (U+103C) - Medial Ra
  ေ (U+1031) - Vowel E
  ာ (U+102C) - Vowel Aa
  င (U+1004) - Nga
  ် (U+103A) - Asat

Deletions are per code point:
  "ြောင်" (delete က)
  "ကောင်" (delete ြ)
  "ကြာင်"  (delete ေ, distance 1)
  "ကြေင်" (delete ာ, distance 1)
  ...

Common Myanmar Typos SymSpell Catches

Error Type	Misspelled	Correct	Edit Distance
Missing medial	”ကောင်"	"ကြောင်”	1
Wrong medial	”ကျောင်"	"ကြောင်”	2 (substitute)
Missing asat	”မြနမာ"	"မြန်မာ”	1
Missing tone	”ကြီ"	"ကြီး”	1
Extra medial	”ကျောင့်"	"ကျောင်း”	1

Syllable vs Word Level

mySpellChecker applies SymSpell at two levels: 1. Syllable Level (faster, catches 90% of errors):

# Syllable correction
"ကျေင်း" → "ကျောင်း"  # Wrong vowel fix (school)
"မြန" → "မြန်"          # Missing asat

2. Word Level (slower, for complex errors):

# Word correction
"မယ်နမာ" → "မြန်မာ"    # Multiple character errors
"ကျေးဇူ" → "ကျေးဇူး"    # Missing visarga

Performance Characteristics

Time Complexity

Operation	Traditional	SymSpell
Single lookup	O(N × M)	O(1) average
Build index	-	O(V × L^d)

Where:

V = vocabulary size
L = average term length
d = max edit distance

Space Complexity

SymSpell trades memory for speed:

Dictionary size: 100,000 words
Average deletions per word: ~50 (at distance 2)
Index size: ~5 million entries

Memory usage: ~50-300 MB (varies by dictionary size and max_edit_distance)

Benchmark: Myanmar Dictionary

Dictionary Size	Index Build Time	Lookup Time	Memory
10,000 words	0.5s	<1ms	~15MB
50,000 words	2s	<1ms	~50MB
100,000 words	5s	<1ms	~50-100MB

Source docstring notes: “Typical Myanmar corpus (100K terms, d=2): ~50-100MB index.” Memory grows with max_edit_distance — values above 2 cause exponential growth.

Configuration

SpellCheckerConfig Options

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SymSpellConfig

config = SpellCheckerConfig(
    # Maximum edit distance for suggestions
    max_edit_distance=2,  # Default: 2 (covers 99% of typos)

    # SymSpell-specific settings
    symspell=SymSpellConfig(
        prefix_length=10,  # Optimize index size (default)
    )
)

checker = SpellChecker(config=config)

Edit Distance Guidelines

Distance	Coverage	Speed	Use Case
1	~85%	Fastest	Real-time typing
2	~99%	Fast	Default, recommended
3	~99.9%	Slower	Batch processing

Prefix Length

The prefix_length parameter optimizes memory by only indexing the first N characters:

# prefix_length=10 (default in SymSpell class)
# Only generates deletions for first N characters
# Reduces index size by ~40% with minimal accuracy loss

# For Myanmar (shorter words), 5-7 is optimal
config = SpellCheckerConfig(
    symspell=SymSpellConfig(prefix_length=5)
)

Implementation Details

Index Structure

# Internal index structure
# Maps delete variations to sets of (original_term, edit_distance) tuples
self._deletes: Dict[str, Set[Tuple[str, int]]] = {
    "မန်မာ": {
        ("မြန်မာ", 1),  # original term, deletion distance
    },
    "ကောင်": {
        ("ကြောင်", 1),
        ("ကောင်း", 1),
    },
    ...
}

Lookup Algorithm

def lookup(word: str, max_distance: int = 2) -> list[Suggestion]:
    candidates = set()

    # Check the word itself
    if word in dictionary:
        candidates.add(word)

    # Generate and check deletions
    for deletion in generate_deletes(word, max_distance):
        if deletion in index:
            for original in index[deletion]:
                # Verify actual edit distance
                dist = edit_distance(word, original)
                if dist <= max_distance:
                    candidates.add((original, dist))

    # Rank using multi-factor scoring (edit distance, frequency,
    # phonetic similarity, syllable match, etc.) via SuggestionRanker
    return ranker.rank(candidates)

Suggestion Dataclass

Each suggestion returned by lookup() includes:

Field	Type	Description
`term`	str	The suggested correction
`edit_distance`	int	Damerau-Levenshtein distance from input
`frequency`	int	Corpus frequency of the term
`phonetic_score`	float	Phonetic similarity score (0.0-1.0)
`syllable_distance`	float \| None	Myanmar syllable-aware weighted distance
`weighted_distance`	float \| None	Myanmar-weighted edit distance using substitution costs
`is_nasal_variant`	bool	True if nasal ending difference only (န်↔ံ)
`has_same_nasal_ending`	bool	True if same nasal consonant as input

Additional Methods

Method	Description
`build_index(levels)`	Build delete index for specified levels (`["syllable"]`, `["word"]`, or both)
`lookup_compound(term)`	Compound word segmentation with suggestions
`compute_frequency_denominator(provider, level, percentile)`	Static method to compute optimal frequency denominator from corpus statistics

Comparison with Other Algorithms

Algorithm	Lookup Speed	Index Size	Accuracy
Brute Force	O(N×M)	Small	100%
BK-Tree	O(log N)	Medium	100%
SymSpell	O(1)	Large	100%
Norvig’s	O(26^d × M)	Small	100%

SymSpell is the fastest for dictionary-based spell checking, making it ideal for real-time applications.

​The Problem: Traditional Edit Distance

​The Solution: Symmetric Delete

​Why “Symmetric”?

​How It Works

​Visual Example: Syllable Correction

​Visual Example: Word Correction

​Myanmar-Specific Considerations

​Character Clusters

​Common Myanmar Typos SymSpell Catches

​Syllable vs Word Level

​Performance Characteristics

​Time Complexity

​Space Complexity

​Benchmark: Myanmar Dictionary

​Configuration

​SpellCheckerConfig Options

​Edit Distance Guidelines

​Prefix Length

​Implementation Details

​Index Structure

​Lookup Algorithm

​Suggestion Dataclass

​Additional Methods

​Comparison with Other Algorithms

​See Also