Overview
Edit distance measures how many operations are needed to transform one string into another. mySpellChecker implements several algorithms optimized for Myanmar text.Algorithms
Levenshtein Distance
The classic edit distance algorithm with three operations:- Insertion: Add a character
- Deletion: Remove a character
- Substitution: Replace a character
Note: Always import fromedit_distance(the Python wrapper), notedit_distance_c(the Cython module) directly. The wrapper automatically uses the Cython implementation when available and falls back to a pure Python implementation when Cython extensions are not compiled. This ensures your code works on all platforms.
Damerau-Levenshtein Distance
Extends Levenshtein with transposition (swap adjacent characters):Weighted Damerau-Levenshtein
Uses custom costs for Myanmar-specific character confusions:Myanmar-Specific Substitution Costs
The weighted algorithm uses reduced costs for commonly confused characters:| Confusion Type | Characters | Cost |
|---|---|---|
| NYA variants | ည ↔ ဉ | 0.1 |
| Vowel length | ိ ↔ ီ, ု ↔ ူ | 0.2 |
| Ra/Ya confusion | ရ ↔ ယ | 0.3 |
| Medial confusion | ျ ↔ ြ | 0.3 |
| Medial confusion | ွ ↔ ှ | 0.4 |
| Aspirated pairs | က ↔ ခ, စ ↔ ဆ, etc. | 0.4 |
Setting Custom Costs
Substitution costs come from two sources that are merged at module import time:- Hardcoded defaults:
MYANMAR_SUBSTITUTION_COSTSintext/phonetic_data.py, which contains hand-crafted costs for well-known confusions. - Data-driven overrides:
rules/confusion_matrix.yaml, which provides corpus-derived substitution costs loaded viaload_confusion_matrix(). This YAML file can override hardcoded costs (e.g., refining ျ↔ြ from 0.3 to 0.2) and add new pairs not covered by the defaults (e.g., asat↔dot_below, ka↔ta).
confusion_matrix.yaml fails to load, the library falls back to the hardcoded costs only.
Implementation Details
UTF-8 Handling
The Cython implementation properly handles Myanmar Unicode:Memory Optimization
The basic Levenshtein implementation uses row-based dynamic programming for O(min(m,n)) space. The Damerau-Levenshtein variants require the full O(m×n) matrix to support transposition lookups:Performance
The Cython implementation provides significant speedup:| Implementation | Speed | Notes |
|---|---|---|
| Pure Python | ~100μs/pair | Reference implementation |
| Cython | ~10μs/pair | 10x faster, uses C++ unordered_map |
Benchmark
Integration with SymSpell
Edit distance is used by SymSpell for suggestion ranking:Pure Python Fallback
If Cython extensions aren’t available, pure Python is used:See Also
- SymSpell Algorithm - Uses edit distance for suggestions
- Cython Guide - Building optimized code
- Word Validation - Suggestion generation