Skip to main content
When two Myanmar strings look similar but differ by a character or two, the library needs a way to quantify that difference. mySpellChecker implements three edit distance variants — classic Levenshtein, Damerau-Levenshtein with transpositions, and a weighted version tuned for common Myanmar character confusions — each backed by optional Cython acceleration.

Overview

Edit distance measures how many operations are needed to transform one string into another. mySpellChecker implements several algorithms optimized for Myanmar text.

Algorithms

Levenshtein Distance

The classic edit distance algorithm with three operations:
  • Insertion: Add a character
  • Deletion: Remove a character
  • Substitution: Replace a character
from myspellchecker.algorithms.distance.edit_distance import levenshtein_distance

distance = levenshtein_distance("မြန်", "မြမ်")
# Result: 1 (one substitution: န် → မ်)
Note: Always import from edit_distance (the Python wrapper), not edit_distance_c (the Cython module) directly. The wrapper automatically uses the Cython implementation when available and falls back to a pure Python implementation when Cython extensions are not compiled. This ensures your code works on all platforms.

Damerau-Levenshtein Distance

Extends Levenshtein with transposition (swap adjacent characters):
from myspellchecker.algorithms.distance.edit_distance import damerau_levenshtein_distance

# Transposition example
distance = damerau_levenshtein_distance("AB", "BA")
# Result: 1 (one transposition)

# Without transposition (Levenshtein)
distance = levenshtein_distance("AB", "BA")
# Result: 2 (substitute A→B, substitute B→A)

Weighted Damerau-Levenshtein

Uses custom costs for Myanmar-specific character confusions:
from myspellchecker.algorithms.distance.edit_distance import (
    weighted_damerau_levenshtein_distance,
)

# Myanmar-specific substitution costs are loaded automatically at import time.
# Common medial confusion: ြ (U+103C) vs ျ (U+103B)
# Cost is 0.3 instead of 1.0
distance = weighted_damerau_levenshtein_distance("ကြ", "ကျ")
# Result: 0.3

Myanmar-Specific Substitution Costs

The weighted algorithm uses reduced costs for commonly confused characters:
Confusion TypeCharactersCost
NYA variantsည ↔ ဉ0.1
Vowel lengthိ ↔ ီ, ု ↔ ူ0.2
Ra/Ya confusionရ ↔ ယ0.3
Medial confusionျ ↔ ြ0.3
Medial confusionွ ↔ ှ0.4
Aspirated pairsက ↔ ခ, စ ↔ ဆ, etc.0.4

Setting Custom Costs

Custom substitution costs are defined in the MYANMAR_SUBSTITUTION_COSTS dictionary within edit_distance.py and applied automatically at module import time via the Cython extension.

Implementation Details

UTF-8 Handling

The Cython implementation properly handles Myanmar Unicode:
# UTF-8 to codepoints conversion
cdef vector[int] utf8_to_codepoints(string s):
    # Handles 1-4 byte UTF-8 sequences
    # Myanmar characters are 3-byte sequences (U+1000-U+109F)

Memory Optimization

Uses row-based dynamic programming for O(min(m,n)) space:
# Space: O(n) where n = len(shorter string)
# Time: O(m*n)

Performance

The Cython implementation provides significant speedup:
ImplementationSpeedNotes
Pure Python~100μs/pairReference implementation
Cython~10μs/pair10x faster, uses C++ unordered_map

Benchmark

import time
from myspellchecker.algorithms.distance.edit_distance import levenshtein_distance

# Benchmark 10,000 comparisons
start = time.time()
for _ in range(10000):
    levenshtein_distance("မြန်မာနိုင်ငံ", "မြနမ်ာနိုင်ငံ")
elapsed = time.time() - start
print(f"10K comparisons: {elapsed:.3f}s")
# Typical result: ~0.1s (Cython) vs ~1.0s (Python)

Integration with SymSpell

Edit distance is used by SymSpell for suggestion ranking:
from myspellchecker.algorithms.symspell import SymSpell

symspell = SymSpell(provider, max_edit_distance=2)

# Suggestions are ranked by edit distance
suggestions = symspell.lookup("မြနမ်ာ", level='word')
for suggestion in suggestions:
    print(f"{suggestion.term}: distance={suggestion.edit_distance}")

Pure Python Fallback

If Cython extensions aren’t available, pure Python is used:
from myspellchecker.algorithms.distance.edit_distance import (
    levenshtein_distance,
    damerau_levenshtein_distance,
)

# Same API, slower performance
distance = levenshtein_distance("မြန်", "မြမ်")

See Also