Skip to main content
Specialized utilities for Myanmar text — rule-based stemming to extract root words, phonetic similarity matching for suggestion ranking, and encoding helpers for handling Myanmar-specific character patterns.

Stemmer

The Stemmer provides rule-based stemming to strip common suffixes from Myanmar words, identifying their root forms.

Use Cases

  • Identifying OOV words that are conjugated forms of known words
  • Aggregating statistics for root words
  • Improving POS tagging by mapping to root POS

Usage

from myspellchecker.text.stemmer import Stemmer

stemmer = Stemmer()

# Stem a word
root, suffixes = stemmer.stem("စားနေသည်")
print(root)      # "စား"
print(suffixes)  # ["နေ", "သည်"]

# More examples
root, suffixes = stemmer.stem("လာခဲ့ပြီ")
print(root)      # "လာ"
print(suffixes)  # ["ခဲ့", "ပြီ"]

Performance Features

  • LRU caching for frequently stemmed words
  • Pre-computed suffix list sorted by length for optimal matching
  • O(n) suffix collection using append + reverse pattern
# Check cache statistics
info = stemmer.cache_info()
print(f"Cache hits: {info['hits']}")
print(f"Cache misses: {info['misses']}")
print(f"Cache size: {info['currsize']}/{info['maxsize']}")

# Clear cache if needed
stemmer.clear_cache()

Configuration

from myspellchecker.core.config.text_configs import StemmerConfig

config = StemmerConfig(cache_size=2048)  # Default: 4096
stemmer = Stemmer(config=config)

Phonetic Hasher

The PhoneticHasher generates phonetic codes for Myanmar text, enabling fuzzy matching based on pronunciation.

Features

  • Groups phonetically similar characters
  • Normalizes tone markers and medials
  • Handles visual confusability
  • LRU caching for performance

Basic Usage

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher()

# Generate phonetic code
code = hasher.encode("မြန်မာ")
print(code)  # 'p-medial_r-vowel_a-n-p-vowel_a'

# Check similarity
code1 = hasher.encode("မြန်")
code2 = hasher.encode("မျန်")  # Wrong medial (ya instead of ra)
is_similar = hasher.similar(code1, code2, max_distance=2)  # default: 1
print(is_similar)  # True

Find Phonetically Similar Words

query = "မျန်"  # Wrong spelling
candidates = ["မြန်", "မန်", "ကျန်", "မျှ"]

results = hasher.find_phonetically_similar(query, candidates, max_results=3)
for word, score in results:
    print(f"{word}: {score:.3f}")
# Output:
# မြန်: 0.950  (very similar)
# မန်: 0.800   (similar)
# ကျန်: 0.600  (somewhat similar)

Generate Phonetic Variants

# Get phonetically similar variants
variants = hasher.get_phonetic_variants("မြန်")
print(variants)  # {'မြန်', 'မျန်', 'ဗြန်', 'ပြန်', ...}

# Get tonal variants (critical for real-word errors)
tonal_variants = hasher.get_tonal_variants("ကား")
print(tonal_variants)  # {'ကား', 'ကာ', 'က'}

Configuration

hasher = PhoneticHasher(
    ignore_tones=True,       # Ignore tone marks (default: True)
    normalize_length=True,   # Treat short/long vowels as same
    max_code_length=10,      # Maximum code length
    adaptive_length=True,    # Extend for compound words
    cache_size=4096          # LRU cache size (0 to disable)
)

Batch Processing

texts = ["မြန်မာ", "ကျောင်း", "သားသမီး"]
codes = hasher.encode_batch(texts)
for text, code in zip(texts, codes):
    print(f"{text} -> {code}")

Tone Disambiguator

The ToneDisambiguator uses context to resolve tone-ambiguous words in Myanmar text.

Myanmar Tone System

ToneMarkerExample
Lowunmarked (short vowel)ငါ (I/me)
Highး (visarga)ငါး (five/fish)
Creaky့ (aukmyit/dot below)လေ့ (habit/practice)
Checkedfinal ် (asat)သပ် (sparse)

Common Ambiguities

WordMeanings
ငါ / ငါးI/me vs five/fish
တော / တော့forest vs (particle, emphasis)
တော / တော်forest vs royal/suitable
ပဲonly/just vs bean

Usage

from myspellchecker.text.tone import ToneDisambiguator, create_disambiguator

disambiguator = ToneDisambiguator()
# Or use factory
disambiguator = create_disambiguator()

# Check if word is tone-ambiguous
disambiguator.is_tone_ambiguous("ငါ")  # True

# Get possible meanings
meanings = disambiguator.get_possible_meanings("ငါ")
for context_type, correct_form, meaning in meanings:
    print(f"{context_type}: {correct_form} ({meaning})")
# Output:
# pronoun: ငါ (I/me)
# fish_missing_tone: ငါး (fish)

Context-Based Disambiguation

# Disambiguate using context
words = ["ငါ", "ကောင်", "ကြော်", "စားတယ်"]  # fish context
result = disambiguator.disambiguate(words, 0)  # Check word at index 0

if result:
    correct_form, meaning, confidence = result
    print(f"Should be: {correct_form} ({meaning})")
    print(f"Confidence: {confidence:.2f}")
# Output: Should be: ငါး (fish), Confidence: 0.75

Check Full Sentence

# Check entire sentence for tone corrections
words = ["ငါ", "သုံ", "ကောင်", "စားတယ်"]
corrections = disambiguator.check_sentence(words)

for index, original, suggestion, confidence in corrections:
    print(f"Position {index}: {original}{suggestion} ({confidence:.2f})")
# Output:
# Position 0: ငါ → ငါး (0.75)  # fish
# Position 1: သုံ → သုံး (0.85) # three

Configuration

from myspellchecker.core.config.text_configs import ToneConfig

config = ToneConfig(
    context_window=3,    # Words to check on each side
    min_confidence=0.2   # Minimum confidence threshold (default: 0.2)
)
disambiguator = ToneDisambiguator(config=config)

Zawgyi Support

Zawgyi is a legacy encoding for Myanmar script. The library detects and handles Zawgyi-encoded text.

Detection

from myspellchecker.text.normalize import (
    is_likely_zawgyi,
    detect_encoding,
    convert_zawgyi_to_unicode,
)

# Quick check (returns Tuple[bool, float])
is_zawgyi, confidence = is_likely_zawgyi("ျမန္မာ")  # (True, 0.99)
is_zawgyi, confidence = is_likely_zawgyi("မြန်မာ")  # (False, 0.01)

# Get detailed detection
encoding, confidence = detect_encoding("ျမန္မာ")
print(encoding)  # "zawgyi" or "unicode"

Conversion

The library includes built-in Zawgyi to Unicode conversion:
# Detection and conversion workflow
if is_likely_zawgyi(text):
    text = convert_zawgyi_to_unicode(text)

result = checker.check(text)

Text Validation

Validate Myanmar text structure using module-level functions:
from myspellchecker.text.validator import validate_word, validate_text

# Quick boolean check for a single word
is_valid = validate_word("ကျောင်း")  # True
is_valid = validate_word("ေကာင္း")   # False (Zawgyi artifact)

# Detailed validation with issue descriptions
result = validate_text("မြန်မာ")
print(result.is_valid)   # True
print(result.issues)     # [] (empty if valid)

Normalization

Text normalization for consistent processing:
from myspellchecker.text.normalize import normalize
from myspellchecker.text.normalize_c import remove_zero_width_chars

# Basic normalization
normalized = normalize("မြန်​မာ")  # Removes zero-width spaces
print(normalized)  # "မြန်မာ"

# Remove zero-width characters (Cython function)
clean = remove_zero_width_chars("hello​world")  # Zero-width space removed

Cython Optimization

Normalization has a Cython-optimized version for performance:
# Automatic fallback pattern
try:
    from myspellchecker.text.normalize_c import remove_zero_width_chars
except ImportError:
    # Pure Python fallback (normalize.py handles this internally)
    pass

Integration

All text utilities integrate with the main spell checker:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(
    # These utilities are used internally
    use_context_checker=True,      # Uses tone disambiguation
    use_rule_based_validation=True,    # Uses syllable rule validation
)

checker = SpellChecker(config=config)

Performance Tips

  1. Enable caching: All utilities support LRU caching
  2. Batch operations: Use batch methods when processing many texts
  3. Adjust cache sizes: Increase for high-throughput scenarios
# High-performance configuration
stemmer = Stemmer(config=StemmerConfig(cache_size=4096))
hasher = PhoneticHasher(cache_size=8192)

See Also