Skip to main content
Myanmar text processing faces challenges that don’t exist in Latin-script languages, including phonetically similar syllables with different spellings, complex suffix patterns for verb conjugation, and character encoding edge cases. These utilities address those challenges and are used internally throughout the validation pipeline.

Stemmer

The Stemmer provides rule-based stemming to strip common suffixes from Myanmar words, identifying their root forms.

Use Cases

  • Identifying OOV words that are conjugated forms of known words
  • Aggregating statistics for root words
  • Improving POS tagging by mapping to root POS

Usage

from myspellchecker.text.stemmer import Stemmer

stemmer = Stemmer()

# Stem a word
root, suffixes = stemmer.stem("စားနေသည်")
print(root)      # "စား"
print(suffixes)  # ["နေ", "သည်"]

# More examples
root, suffixes = stemmer.stem("လာခဲ့ပြီ")
print(root)      # "လာ"
print(suffixes)  # ["ခဲ့", "ပြီ"]

Performance Features

  • LRU caching for frequently stemmed words
  • Pre-computed suffix list sorted by length for optimal matching
  • O(n) suffix collection using append + reverse pattern
# Check cache statistics
info = stemmer.cache_info()
print(f"Cache hits: {info['hits']}")
print(f"Cache misses: {info['misses']}")
print(f"Cache size: {info['currsize']}/{info['maxsize']}")

# Clear cache if needed
stemmer.clear_cache()

Configuration

from myspellchecker.core.config.text_configs import StemmerConfig

config = StemmerConfig(cache_size=2048)  # Default: 4096
stemmer = Stemmer(config=config)

Phonetic Hasher

The PhoneticHasher generates phonetic codes for Myanmar text, enabling fuzzy matching based on pronunciation.

Features

  • Groups phonetically similar characters
  • Normalizes tone markers and medials
  • Handles visual confusability
  • LRU caching for performance

Basic Usage

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher()

# Generate phonetic code
code = hasher.encode("မြန်မာ")
print(code)  # 'p-medial_r-vowel_a-n-p-vowel_a'

# Check similarity
code1 = hasher.encode("မြန်")
code2 = hasher.encode("မျန်")  # Wrong medial (ya instead of ra)
is_similar = hasher.similar(code1, code2, max_distance=2)  # default: 1
print(is_similar)  # True

Generate Phonetic Variants

# Get phonetically similar variants
variants = hasher.get_phonetic_variants("မြန်")
print(variants)  # {'မြန်', 'မျန်', 'ဗြန်', 'ပြန်', ...}

# Get tonal variants (critical for real-word errors)
tonal_variants = hasher.get_tonal_variants("ကား")
print(tonal_variants)  # {'ကား', 'ကာ', 'က'}

Configuration

hasher = PhoneticHasher(
    ignore_tones=True,       # Ignore tone marks (default: True)
    normalize_length=True,   # Treat short/long vowels as same
    max_code_length=10,      # Maximum code length
    adaptive_length=True,    # Extend for compound words
    cache_size=4096          # LRU cache size (0 to disable)
)

Tone Disambiguator

The ToneDisambiguator uses context to resolve tone-ambiguous words in Myanmar text.

Myanmar Tone System

ToneMarkerExample
Lowunmarked (short vowel)ငါ (I/me)
Highး (visarga)ငါး (five/fish)
Creaky့ (aukmyit/dot below)လေ့ (habit/practice)
Checkedfinal ် (asat)သပ် (sparse)

Common Ambiguities

WordMeanings
ငါ / ငါးI/me vs five/fish
တော / တော့forest vs (particle, emphasis)
တော / တော်forest vs royal/suitable
ပဲonly/just vs bean

Usage

from myspellchecker.text.tone import ToneDisambiguator, create_disambiguator

disambiguator = ToneDisambiguator()
# Or use factory
disambiguator = create_disambiguator()

Context-Based Disambiguation

# Disambiguate using context
words = ["ငါ", "ကောင်", "ကြော်", "စားတယ်"]  # fish context
result = disambiguator.disambiguate(words, 0)  # Check word at index 0

if result:
    correct_form, meaning, confidence = result
    print(f"Should be: {correct_form} ({meaning})")
    print(f"Confidence: {confidence:.2f}")
# Output: Should be: ငါး (fish), Confidence: 0.75

Check Full Sentence

# Check entire sentence for tone corrections
words = ["ငါ", "သုံ", "ကောင်", "စားတယ်"]
corrections = disambiguator.check_sentence(words)

for index, original, suggestion, confidence in corrections:
    print(f"Position {index}: {original}{suggestion} ({confidence:.2f})")
# Output:
# Position 0: ငါ → ငါး (0.75)  # fish
# Position 1: သုံ → သုံး (0.85) # three

Configuration

from myspellchecker.core.config.text_configs import ToneConfig

config = ToneConfig(
    context_window=3,    # Words to check on each side
    min_confidence=0.2   # Minimum confidence threshold (default: 0.2)
)
disambiguator = ToneDisambiguator(config=config)

Zawgyi Support

Zawgyi is a legacy encoding for Myanmar script. The library detects and handles Zawgyi-encoded text.

Detection

from myspellchecker.text.normalize import (
    is_likely_zawgyi,
    detect_encoding,
    convert_zawgyi_to_unicode,
)

# Quick check (returns Tuple[bool, float])
is_zawgyi, confidence = is_likely_zawgyi("ျမန္မာ")  # (True, 0.99)
is_zawgyi, confidence = is_likely_zawgyi("မြန်မာ")  # (False, 0.01)

# Get detailed detection
encoding, confidence = detect_encoding("ျမန္မာ")
print(encoding)  # "zawgyi" or "unicode"

Conversion

The library includes built-in Zawgyi to Unicode conversion:
# Detection and conversion workflow
if is_likely_zawgyi(text):
    text = convert_zawgyi_to_unicode(text)

result = checker.check(text)

Text Validation

Validate Myanmar text structure using module-level functions:
from myspellchecker.text.validator import validate_word, validate_text

# Quick boolean check for a single word
is_valid = validate_word("ကျောင်း")  # True
is_valid = validate_word("ေကာင္း")   # False (Zawgyi artifact)

# Detailed validation with issue descriptions
result = validate_text("မြန်မာ")
print(result.is_valid)   # True
print(result.issues)     # [] (empty if valid)

Normalization

Text normalization for consistent processing:
from myspellchecker.text.normalize import normalize
from myspellchecker.text.normalize_c import remove_zero_width_chars

# Basic normalization
normalized = normalize("မြန်​မာ")  # Removes zero-width spaces
print(normalized)  # "မြန်မာ"

# Remove zero-width characters (Cython function)
clean = remove_zero_width_chars("hello​world")  # Zero-width space removed

Cython Optimization

Normalization has a Cython-optimized version for performance:
# Automatic fallback pattern
try:
    from myspellchecker.text.normalize_c import remove_zero_width_chars
except ImportError:
    # Pure Python fallback (normalize.py handles this internally)
    pass

Integration

All text utilities integrate with the main spell checker:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(
    # These utilities are used internally
    use_context_checker=True,      # Uses tone disambiguation
    use_rule_based_validation=True,    # Uses syllable rule validation
)

checker = SpellChecker(config=config)

Performance Tips

  1. Enable caching: All utilities support LRU caching
  2. Batch operations: Use batch methods when processing many texts
  3. Adjust cache sizes: Increase for high-throughput scenarios
# High-performance configuration
stemmer = Stemmer(config=StemmerConfig(cache_size=4096))
hasher = PhoneticHasher(cache_size=8192)

See Also