Overview
Edit distance measures how many operations are needed to transform one string into another. mySpellChecker implements several algorithms optimized for Myanmar text.Algorithms
Levenshtein Distance
The classic edit distance algorithm with three operations:- Insertion: Add a character
- Deletion: Remove a character
- Substitution: Replace a character
Note: Always import fromedit_distance(the Python wrapper), notedit_distance_c(the Cython module) directly. The wrapper automatically uses the Cython implementation when available and falls back to a pure Python implementation when Cython extensions are not compiled. This ensures your code works on all platforms.
Damerau-Levenshtein Distance
Extends Levenshtein with transposition (swap adjacent characters):Weighted Damerau-Levenshtein
Uses custom costs for Myanmar-specific character confusions:Myanmar-Specific Substitution Costs
The weighted algorithm uses reduced costs for commonly confused characters:| Confusion Type | Characters | Cost |
|---|---|---|
| NYA variants | ည ↔ ဉ | 0.1 |
| Vowel length | ိ ↔ ီ, ု ↔ ူ | 0.2 |
| Ra/Ya confusion | ရ ↔ ယ | 0.3 |
| Medial confusion | ျ ↔ ြ | 0.3 |
| Medial confusion | ွ ↔ ှ | 0.4 |
| Aspirated pairs | က ↔ ခ, စ ↔ ဆ, etc. | 0.4 |
Setting Custom Costs
Custom substitution costs are defined in theMYANMAR_SUBSTITUTION_COSTS dictionary within edit_distance.py and applied automatically at module import time via the Cython extension.
Implementation Details
UTF-8 Handling
The Cython implementation properly handles Myanmar Unicode:Memory Optimization
Uses row-based dynamic programming for O(min(m,n)) space:Performance
The Cython implementation provides significant speedup:| Implementation | Speed | Notes |
|---|---|---|
| Pure Python | ~100μs/pair | Reference implementation |
| Cython | ~10μs/pair | 10x faster, uses C++ unordered_map |
Benchmark
Integration with SymSpell
Edit distance is used by SymSpell for suggestion ranking:Pure Python Fallback
If Cython extensions aren’t available, pure Python is used:See Also
- SymSpell Algorithm - Uses edit distance for suggestions
- Cython Guide - Building optimized code
- Word Validation - Suggestion generation