Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Myanmar text processing faces challenges that don’t exist in Latin-script languages, including phonetically similar syllables with different spellings, complex suffix patterns for verb conjugation, and character encoding edge cases. These utilities address those challenges and are used internally throughout the validation pipeline.
Stemmer
The Stemmer provides rule-based stemming to strip common suffixes from Myanmar words, identifying their root forms.
Use Cases
- Identifying OOV words that are conjugated forms of known words
- Aggregating statistics for root words
- Improving POS tagging by mapping to root POS
Usage
from myspellchecker.text.stemmer import Stemmer
stemmer = Stemmer()
# Stem a word
root, suffixes = stemmer.stem("စားနေသည်")
print(root) # "စား"
print(suffixes) # ["နေ", "သည်"]
# More examples
root, suffixes = stemmer.stem("လာခဲ့ပြီ")
print(root) # "လာ"
print(suffixes) # ["ခဲ့", "ပြီ"]
- LRU caching for frequently stemmed words
- Pre-computed suffix list sorted by length for optimal matching
- O(n) suffix collection using append + reverse pattern
# Check cache statistics
info = stemmer.cache_info()
print(f"Cache hits: {info['hits']}")
print(f"Cache misses: {info['misses']}")
print(f"Cache size: {info['currsize']}/{info['maxsize']}")
# Clear cache if needed
stemmer.clear_cache()
Configuration
from myspellchecker.core.config.text_configs import StemmerConfig
config = StemmerConfig(cache_size=2048) # Default: 4096
stemmer = Stemmer(config=config)
Phonetic Hasher
The PhoneticHasher generates phonetic codes for Myanmar text, enabling fuzzy matching based on pronunciation.
Features
- Groups phonetically similar characters
- Normalizes tone markers and medials
- Handles visual confusability
- LRU caching for performance
Basic Usage
from myspellchecker.text.phonetic import PhoneticHasher
hasher = PhoneticHasher()
# Generate phonetic code
code = hasher.encode("မြန်မာ")
print(code) # 'p-medial_r-vowel_a-n-p-vowel_a'
# Check similarity
code1 = hasher.encode("မြန်")
code2 = hasher.encode("မျန်") # Wrong medial (ya instead of ra)
is_similar = hasher.similar(code1, code2, max_distance=2) # default: 1
print(is_similar) # True
Generate Phonetic Variants
# Get phonetically similar variants
variants = hasher.get_phonetic_variants("မြန်")
print(variants) # {'မြန်', 'မျန်', 'ဗြန်', 'ပြန်', ...}
# Get tonal variants (critical for real-word errors)
tonal_variants = hasher.get_tonal_variants("ကား")
print(tonal_variants) # {'ကား', 'ကာ', 'က'}
Configuration
hasher = PhoneticHasher(
ignore_tones=True, # Ignore tone marks (default: True)
normalize_length=True, # Treat short/long vowels as same
max_code_length=10, # Maximum code length
adaptive_length=True, # Extend for compound words
cache_size=4096 # LRU cache size (0 to disable)
)
Tone Disambiguator
The ToneDisambiguator uses context to resolve tone-ambiguous words in Myanmar text.
Myanmar Tone System
| Tone | Marker | Example |
|---|
| Low | unmarked (short vowel) | ငါ (I/me) |
| High | း (visarga) | ငါး (five/fish) |
| Creaky | ့ (aukmyit/dot below) | လေ့ (habit/practice) |
| Checked | final ် (asat) | သပ် (sparse) |
Common Ambiguities
| Word | Meanings |
|---|
| ငါ / ငါး | I/me vs five/fish |
| တော / တော့ | forest vs (particle, emphasis) |
| တော / တော် | forest vs royal/suitable |
| ပဲ | only/just vs bean |
Usage
from myspellchecker.text.tone import ToneDisambiguator, create_disambiguator
disambiguator = ToneDisambiguator()
# Or use factory
disambiguator = create_disambiguator()
Context-Based Disambiguation
# Disambiguate using context
words = ["ငါ", "ကောင်", "ကြော်", "စားတယ်"] # fish context
result = disambiguator.disambiguate(words, 0) # Check word at index 0
if result:
correct_form, meaning, confidence = result
print(f"Should be: {correct_form} ({meaning})")
print(f"Confidence: {confidence:.2f}")
# Output: Should be: ငါး (fish), Confidence: 0.75
Check Full Sentence
# Check entire sentence for tone corrections
words = ["ငါ", "သုံ", "ကောင်", "စားတယ်"]
corrections = disambiguator.check_sentence(words)
for index, original, suggestion, confidence in corrections:
print(f"Position {index}: {original} → {suggestion} ({confidence:.2f})")
# Output:
# Position 0: ငါ → ငါး (0.75) # fish
# Position 1: သုံ → သုံး (0.85) # three
Configuration
from myspellchecker.core.config.text_configs import ToneConfig
config = ToneConfig(
context_window=3, # Words to check on each side
min_confidence=0.2 # Minimum confidence threshold (default: 0.2)
)
disambiguator = ToneDisambiguator(config=config)
Zawgyi Support
Zawgyi is a legacy encoding for Myanmar script. The library detects and handles Zawgyi-encoded text.
Detection
from myspellchecker.text.normalize import (
is_likely_zawgyi,
detect_encoding,
convert_zawgyi_to_unicode,
)
# Quick check (returns Tuple[bool, float])
is_zawgyi, confidence = is_likely_zawgyi("ျမန္မာ") # (True, 0.99)
is_zawgyi, confidence = is_likely_zawgyi("မြန်မာ") # (False, 0.01)
# Get detailed detection
encoding, confidence = detect_encoding("ျမန္မာ")
print(encoding) # "zawgyi" or "unicode"
Conversion
The library includes built-in Zawgyi to Unicode conversion:
# Detection and conversion workflow
if is_likely_zawgyi(text):
text = convert_zawgyi_to_unicode(text)
result = checker.check(text)
Text Validation
Validate Myanmar text structure using module-level functions:
from myspellchecker.text.validator import validate_word, validate_text
# Quick boolean check for a single word
is_valid = validate_word("ကျောင်း") # True
is_valid = validate_word("ေကာင္း") # False (Zawgyi artifact)
# Detailed validation with issue descriptions
result = validate_text("မြန်မာ")
print(result.is_valid) # True
print(result.issues) # [] (empty if valid)
Normalization
Text normalization for consistent processing:
from myspellchecker.text.normalize import normalize
from myspellchecker.text.normalize_c import remove_zero_width_chars
# Basic normalization
normalized = normalize("မြန်မာ") # Removes zero-width spaces
print(normalized) # "မြန်မာ"
# Remove zero-width characters (Cython function)
clean = remove_zero_width_chars("helloworld") # Zero-width space removed
Cython Optimization
Normalization has a Cython-optimized version for performance:
# Automatic fallback pattern
try:
from myspellchecker.text.normalize_c import remove_zero_width_chars
except ImportError:
# Pure Python fallback (normalize.py handles this internally)
pass
Integration
All text utilities integrate with the main spell checker:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
config = SpellCheckerConfig(
# These utilities are used internally
use_context_checker=True, # Uses tone disambiguation
use_rule_based_validation=True, # Uses syllable rule validation
)
checker = SpellChecker(config=config)
- Enable caching: All utilities support LRU caching
- Batch operations: Use batch methods when processing many texts
- Adjust cache sizes: Increase for high-throughput scenarios
# High-performance configuration
stemmer = Stemmer(config=StemmerConfig(cache_size=4096))
hasher = PhoneticHasher(cache_size=8192)
See Also