Edit Distance Algorithms - mySpellChecker

When two Myanmar strings look similar but differ by a character or two, the library needs a way to quantify that difference. mySpellChecker implements three edit distance variants — classic Levenshtein, Damerau-Levenshtein with transpositions, and a weighted version tuned for common Myanmar character confusions — each backed by optional Cython acceleration.

Overview

Edit distance measures how many operations are needed to transform one string into another. mySpellChecker implements several algorithms optimized for Myanmar text.

Algorithms

Levenshtein Distance

The classic edit distance algorithm with three operations:

Insertion: Add a character
Deletion: Remove a character
Substitution: Replace a character

from myspellchecker.algorithms.distance.edit_distance import levenshtein_distance

distance = levenshtein_distance("မြန်", "မြမ်")
# Result: 1 (one substitution: န် → မ်)

Note: Always import from edit_distance (the Python wrapper), not edit_distance_c (the Cython module) directly. The wrapper automatically uses the Cython implementation when available and falls back to a pure Python implementation when Cython extensions are not compiled. This ensures your code works on all platforms.

Damerau-Levenshtein Distance

Extends Levenshtein with transposition (swap adjacent characters):

from myspellchecker.algorithms.distance.edit_distance import damerau_levenshtein_distance

# Transposition example
distance = damerau_levenshtein_distance("AB", "BA")
# Result: 1 (one transposition)

# Without transposition (Levenshtein)
distance = levenshtein_distance("AB", "BA")
# Result: 2 (substitute A→B, substitute B→A)

Weighted Damerau-Levenshtein

Uses custom costs for Myanmar-specific character confusions:

from myspellchecker.algorithms.distance.edit_distance import (
    weighted_damerau_levenshtein_distance,
)

# Myanmar-specific substitution costs are loaded automatically at import time.
# Common medial confusion: ြ (U+103C) vs ျ (U+103B)
# Cost is 0.3 instead of 1.0
distance = weighted_damerau_levenshtein_distance("ကြ", "ကျ")
# Result: 0.3

Myanmar-Specific Substitution Costs

The weighted algorithm uses reduced costs for commonly confused characters:

Confusion Type	Characters	Cost
NYA variants	ည ↔ ဉ	0.1
Vowel length	ိ ↔ ီ, ု ↔ ူ	0.2
Ra/Ya confusion	ရ ↔ ယ	0.3
Medial confusion	ျ ↔ ြ	0.3
Medial confusion	ွ ↔ ှ	0.4
Aspirated pairs	က ↔ ခ, စ ↔ ဆ, etc.	0.4

Setting Custom Costs

Substitution costs come from two sources that are merged at module import time:

Hardcoded defaults: MYANMAR_SUBSTITUTION_COSTS in text/phonetic_data.py, which contains hand-crafted costs for well-known confusions.
Data-driven overrides: rules/confusion_matrix.yaml, which provides corpus-derived substitution costs loaded via load_confusion_matrix(). This YAML file can override hardcoded costs (e.g., refining ျ↔ြ from 0.3 to 0.2) and add new pairs not covered by the defaults (e.g., asat↔dot_below, ka↔ta).

The merge uses YAML-wins semantics: if both sources define the same character pair, the YAML cost takes precedence. If confusion_matrix.yaml fails to load, the library falls back to the hardcoded costs only.

Implementation Details

UTF-8 Handling

The Cython implementation properly handles Myanmar Unicode:

# UTF-8 to codepoints conversion
cdef vector[int] utf8_to_codepoints(string s):
    # Handles 1-4 byte UTF-8 sequences
    # Myanmar characters are 3-byte sequences (U+1000-U+109F)

Memory Optimization

The basic Levenshtein implementation uses row-based dynamic programming for O(min(m,n)) space. The Damerau-Levenshtein variants require the full O(m×n) matrix to support transposition lookups:

# Levenshtein: Space O(n), Time O(m*n)
# Damerau-Levenshtein: Space O(m*n), Time O(m*n)

Performance

The Cython implementation provides significant speedup:

Implementation	Speed	Notes
Pure Python	~100μs/pair	Reference implementation
Cython	~10μs/pair	10x faster, uses C++ unordered_map

Benchmark

import time
from myspellchecker.algorithms.distance.edit_distance import levenshtein_distance

# Benchmark 10,000 comparisons
start = time.time()
for _ in range(10000):
    levenshtein_distance("မြန်မာနိုင်ငံ", "မြနမ်ာနိုင်ငံ")
elapsed = time.time() - start
print(f"10K comparisons: {elapsed:.3f}s")
# Typical result: ~0.1s (Cython) vs ~1.0s (Python)

Integration with SymSpell

Edit distance is used by SymSpell for suggestion ranking:

from myspellchecker.algorithms.symspell import SymSpell

symspell = SymSpell(provider, max_edit_distance=2)

# Suggestions are ranked by edit distance
suggestions = symspell.lookup("မြနမ်ာ", level='word')
for suggestion in suggestions:
    print(f"{suggestion.term}: distance={suggestion.edit_distance}")

Pure Python Fallback

If Cython extensions aren’t available, pure Python is used:

from myspellchecker.algorithms.distance.edit_distance import (
    levenshtein_distance,
    damerau_levenshtein_distance,
)

# Same API, slower performance
distance = levenshtein_distance("မြန်", "မြမ်")

​Overview

​Algorithms

​Levenshtein Distance

​Damerau-Levenshtein Distance

​Weighted Damerau-Levenshtein

​Myanmar-Specific Substitution Costs

​Setting Custom Costs

​Implementation Details

​UTF-8 Handling

​Memory Optimization

​Performance

​Benchmark

​Integration with SymSpell

​Pure Python Fallback

​See Also