Text Utilities - mySpellChecker

Myanmar text processing faces challenges that don’t exist in Latin-script languages, including phonetically similar syllables with different spellings, complex suffix patterns for verb conjugation, and character encoding edge cases. These utilities address those challenges and are used internally throughout the validation pipeline.

Stemmer

The Stemmer provides rule-based stemming to strip common suffixes from Myanmar words, identifying their root forms.

Use Cases

Identifying OOV words that are conjugated forms of known words
Aggregating statistics for root words
Improving POS tagging by mapping to root POS

Usage

from myspellchecker.text.stemmer import Stemmer

stemmer = Stemmer()

# Stem a word
root, suffixes = stemmer.stem("စားနေသည်")
print(root)      # "စား"
print(suffixes)  # ["နေ", "သည်"]

# More examples
root, suffixes = stemmer.stem("လာခဲ့ပြီ")
print(root)      # "လာ"
print(suffixes)  # ["ခဲ့", "ပြီ"]

Performance Features

LRU caching for frequently stemmed words
Pre-computed suffix list sorted by length for optimal matching
O(n) suffix collection using append + reverse pattern

# Check cache statistics
info = stemmer.cache_info()
print(f"Cache hits: {info['hits']}")
print(f"Cache misses: {info['misses']}")
print(f"Cache size: {info['currsize']}/{info['maxsize']}")

# Clear cache if needed
stemmer.clear_cache()

Configuration

from myspellchecker.core.config.text_configs import StemmerConfig

config = StemmerConfig(cache_size=2048)  # Default: 4096
stemmer = Stemmer(config=config)

Phonetic Hasher

The PhoneticHasher generates phonetic codes for Myanmar text, enabling fuzzy matching based on pronunciation.

Features

Groups phonetically similar characters
Normalizes tone markers and medials
Handles visual confusability
LRU caching for performance

Basic Usage

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher()

# Generate phonetic code
code = hasher.encode("မြန်မာ")
print(code)  # 'p-medial_r-vowel_a-n-p-vowel_a'

# Check similarity
code1 = hasher.encode("မြန်")
code2 = hasher.encode("မျန်")  # Wrong medial (ya instead of ra)
is_similar = hasher.similar(code1, code2, max_distance=2)  # default: 1
print(is_similar)  # True

Generate Phonetic Variants

# Get phonetically similar variants
variants = hasher.get_phonetic_variants("မြန်")
print(variants)  # {'မြန်', 'မျန်', 'ဗြန်', 'ပြန်', ...}

# Get tonal variants (critical for real-word errors)
tonal_variants = hasher.get_tonal_variants("ကား")
print(tonal_variants)  # {'ကား', 'ကာ', 'က'}

Configuration

hasher = PhoneticHasher(
    ignore_tones=True,       # Ignore tone marks (default: True)
    normalize_length=True,   # Treat short/long vowels as same
    max_code_length=10,      # Maximum code length
    adaptive_length=True,    # Extend for compound words
    cache_size=4096          # LRU cache size (0 to disable)
)

Tone Disambiguator

The ToneDisambiguator uses context to resolve tone-ambiguous words in Myanmar text.

Myanmar Tone System

Tone	Marker	Example
Low	unmarked (short vowel)	ငါ (I/me)
High	း (visarga)	ငါး (five/fish)
Creaky	့ (aukmyit/dot below)	လေ့ (habit/practice)
Checked	final ် (asat)	သပ် (sparse)

Common Ambiguities

Word	Meanings
ငါ / ငါး	I/me vs five/fish
တော / တော့	forest vs (particle, emphasis)
တော / တော်	forest vs royal/suitable
ပဲ	only/just vs bean

Usage

from myspellchecker.text.tone import ToneDisambiguator, create_disambiguator

disambiguator = ToneDisambiguator()
# Or use factory
disambiguator = create_disambiguator()

Context-Based Disambiguation

# Disambiguate using context
words = ["ငါ", "ကောင်", "ကြော်", "စားတယ်"]  # fish context
result = disambiguator.disambiguate(words, 0)  # Check word at index 0

if result:
    correct_form, meaning, confidence = result
    print(f"Should be: {correct_form} ({meaning})")
    print(f"Confidence: {confidence:.2f}")
# Output: Should be: ငါး (fish), Confidence: 0.75

Check Full Sentence

# Check entire sentence for tone corrections
words = ["ငါ", "သုံ", "ကောင်", "စားတယ်"]
corrections = disambiguator.check_sentence(words)

for index, original, suggestion, confidence in corrections:
    print(f"Position {index}: {original} → {suggestion} ({confidence:.2f})")
# Output:
# Position 0: ငါ → ငါး (0.75)  # fish
# Position 1: သုံ → သုံး (0.85) # three

Configuration

from myspellchecker.core.config.text_configs import ToneConfig

config = ToneConfig(
    context_window=3,    # Words to check on each side
    min_confidence=0.2   # Minimum confidence threshold (default: 0.2)
)
disambiguator = ToneDisambiguator(config=config)

Zawgyi Support

Zawgyi is a legacy encoding for Myanmar script. The library detects and handles Zawgyi-encoded text.

Detection

from myspellchecker.text.normalize import (
    is_likely_zawgyi,
    detect_encoding,
    convert_zawgyi_to_unicode,
)

# Quick check (returns Tuple[bool, float])
is_zawgyi, confidence = is_likely_zawgyi("ျမန္မာ")  # (True, 0.99)
is_zawgyi, confidence = is_likely_zawgyi("မြန်မာ")  # (False, 0.01)

# Get detailed detection
encoding, confidence = detect_encoding("ျမန္မာ")
print(encoding)  # "zawgyi" or "unicode"

Conversion

The library includes built-in Zawgyi to Unicode conversion:

# Detection and conversion workflow
if is_likely_zawgyi(text):
    text = convert_zawgyi_to_unicode(text)

result = checker.check(text)

Text Validation

Validate Myanmar text structure using module-level functions:

from myspellchecker.text.validator import validate_word, validate_text

# Quick boolean check for a single word
is_valid = validate_word("ကျောင်း")  # True
is_valid = validate_word("ေကာင္း")   # False (Zawgyi artifact)

# Detailed validation with issue descriptions
result = validate_text("မြန်မာ")
print(result.is_valid)   # True
print(result.issues)     # [] (empty if valid)

Normalization

Text normalization for consistent processing:

from myspellchecker.text.normalize import normalize
from myspellchecker.text.normalize_c import remove_zero_width_chars

# Basic normalization
normalized = normalize("မြန်​မာ")  # Removes zero-width spaces
print(normalized)  # "မြန်မာ"

# Remove zero-width characters (Cython function)
clean = remove_zero_width_chars("hello​world")  # Zero-width space removed

Cython Optimization

Normalization has a Cython-optimized version for performance:

# Automatic fallback pattern
try:
    from myspellchecker.text.normalize_c import remove_zero_width_chars
except ImportError:
    # Pure Python fallback (normalize.py handles this internally)
    pass

Integration

All text utilities integrate with the main spell checker:

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(
    # These utilities are used internally
    use_context_checker=True,      # Uses tone disambiguation
    use_rule_based_validation=True,    # Uses syllable rule validation
)

checker = SpellChecker(config=config)

Performance Tips

Enable caching: All utilities support LRU caching
Batch operations: Use batch methods when processing many texts
Adjust cache sizes: Increase for high-throughput scenarios

# High-performance configuration
stemmer = Stemmer(config=StemmerConfig(cache_size=4096))
hasher = PhoneticHasher(cache_size=8192)

​Stemmer

​Use Cases

​Usage

​Performance Features

​Configuration

​Phonetic Hasher

​Features

​Basic Usage

​Generate Phonetic Variants

​Configuration

​Tone Disambiguator

​Myanmar Tone System

​Common Ambiguities

​Usage

​Context-Based Disambiguation

​Check Full Sentence

​Configuration

​Zawgyi Support

​Detection

​Conversion

​Text Validation

​Normalization

​Cython Optimization

​Integration

​Performance Tips

​See Also

Stemmer

Use Cases

Usage

Performance Features

Configuration

Phonetic Hasher

Features

Basic Usage

Generate Phonetic Variants

Configuration

Tone Disambiguator

Myanmar Tone System

Common Ambiguities

Usage

Context-Based Disambiguation

Check Full Sentence

Configuration

Zawgyi Support

Detection

Conversion

Text Validation

Normalization

Cython Optimization

Integration

Performance Tips

See Also