Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Myanmar has two competing text encodings: Unicode (international standard) and Zawgyi (legacy, still widely used). Zawgyi text fed to a Unicode spell checker will fail validation entirely. mySpellChecker detects Zawgyi automatically and converts it to Unicode before checking.
Overview
from myspellchecker.text.zawgyi_support import (
get_zawgyi_detector,
is_zawgyi_converter_available,
convert_zawgyi_to_unicode,
)
# Check if text is Zawgyi
detector = get_zawgyi_detector()
if detector:
prob = detector.get_zawgyi_probability("ျမန္မာ")
print(f"Zawgyi probability: {prob:.2f}") # 0.99
# Convert to Unicode
unicode_text = convert_zawgyi_to_unicode("ျမန္မာ")
print(unicode_text) # "မြန်မာ"
Background: Zawgyi vs Unicode
The Problem
Myanmar has two competing text encodings:
- Unicode - International standard (recommended)
- Zawgyi - Legacy encoding still widely used
Zawgyi text appears garbled in Unicode systems and vice versa:
| Encoding | ”Myanmar” | Codepoints |
|---|
| Unicode | မြန်မာ | U+1019 U+103C U+1014 U+103A U+1019 U+102C |
| Zawgyi | ျမန္မာ | U+103B U+1019 U+1014 U+1039 U+1019 U+102C |
Why It Matters for Spell Checking
Zawgyi text fed to a Unicode spell checker will:
- Fail syllable validation
- Generate incorrect suggestions
- Miss actual spelling errors
The solution: detect and convert Zawgyi to Unicode before spell checking.
Detection Markers
Zawgyi-specific patterns that indicate encoding:
| Pattern | Unicode | Zawgyi |
|---|
| Medial Ra | ြ (U+103C) | ၾ, ႀ, ႂ, ႃ (various) |
| Kinzi | င် + ္ | ၎င္း (special) |
| Stacking | ္ + consonant | Multiple variants |
| Tall AA | ါ | ါ with different code |
Conversion Rules
Key character mappings during Zawgyi-to-Unicode conversion:
| Unicode | Zawgyi | Description |
|---|
| ြ (U+103C) | ၾ/ႀ/ႂ/ႃ | Medial Ra variants |
| ု (U+102F) | ု (different code) | Below vowel U |
| ူ (U+1030) | ူ (different code) | Below vowel UU |
| ေ + C | C + ေ | Vowel E reordering |
Functions
get_zawgyi_detector
Get or create a ZawgyiDetector instance (thread-safe singleton):
from myspellchecker.text.zawgyi_support import get_zawgyi_detector
detector = get_zawgyi_detector()
if detector:
# Detector available
prob = detector.get_zawgyi_probability("ျမန္မာ")
if prob > 0.95:
print("Text is Zawgyi encoded")
else:
# myanmartools not installed
print("Zawgyi detection unavailable")
Returns: ZawgyiDetector instance or None if myanmartools is not installed.
is_zawgyi_converter_available
Check if Zawgyi conversion is available:
from myspellchecker.text.zawgyi_support import is_zawgyi_converter_available
if is_zawgyi_converter_available():
# python-myanmar package is installed
print("Conversion available")
else:
print("Install python-myanmar for conversion")
Returns: True if python-myanmar converter is available.
convert_zawgyi_to_unicode
Convert Zawgyi text to Unicode:
from myspellchecker.text.zawgyi_support import convert_zawgyi_to_unicode
# Zawgyi input
zawgyi_text = "ျမန္မာ"
unicode_text = convert_zawgyi_to_unicode(zawgyi_text)
print(unicode_text) # "မြန်မာ"
# Already Unicode - unchanged
unicode_text = convert_zawgyi_to_unicode("မြန်မာ")
print(unicode_text) # "မြန်မာ"
# Custom threshold
unicode_text = convert_zawgyi_to_unicode(text, threshold=0.90)
Parameters:
text - Text to convert (may be Zawgyi or Unicode)
threshold - Minimum Zawgyi probability to trigger conversion (default: 0.95)
Returns: Converted Unicode text, or original if not Zawgyi or conversion unavailable.
Dependencies
Both myanmartools and python-myanmar are core dependencies and are installed automatically with pip install myspellchecker. No additional installation is needed for Zawgyi support.
Integration
With Text Normalization
from myspellchecker.text.normalization_service import get_normalization_service
service = get_normalization_service()
# for_spell_checking does NOT convert Zawgyi — it's fast normalization only
normalized = service.for_spell_checking(text)
# Handles: NFC → zero-width removal → diacritic reordering
# (Zawgyi detection/conversion is handled separately by the validation pipeline)
# For Zawgyi conversion, use for_dictionary_lookup instead
normalized = service.for_dictionary_lookup(text)
# Handles: Zawgyi → Unicode → NFC → zero-width removal → diacritic reordering
With SpellChecker
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig
config = SpellCheckerConfig(
validation=ValidationConfig(
use_zawgyi_detection=True, # Enable detection
use_zawgyi_conversion=True, # Auto-convert if detected
)
)
checker = SpellChecker(config=config)
result = checker.check("ျမန္မာ") # Zawgyi input
# Text is auto-converted before checking
Manual Preprocessing
from myspellchecker.text.zawgyi_support import (
get_zawgyi_detector,
convert_zawgyi_to_unicode,
)
def preprocess_text(text: str) -> str:
"""Preprocess text, converting Zawgyi if detected."""
detector = get_zawgyi_detector()
if detector:
prob = detector.get_zawgyi_probability(text)
if prob > 0.95:
return convert_zawgyi_to_unicode(text)
return text
# Use in pipeline
text = preprocess_text(user_input)
result = checker.check(text)
Thread Safety
Both functions use functools.lru_cache for thread-safe singleton patterns:
# Safe for concurrent use
import threading
def process_in_thread(text):
detector = get_zawgyi_detector() # Same instance across threads
# ... process text
threads = [threading.Thread(target=process_in_thread, args=(t,)) for t in texts]
Error Handling
The module handles errors gracefully:
# Detection errors
try:
prob = detector.get_zawgyi_probability(text)
except Exception:
# Logged, returns 0.0
# Conversion errors
result = convert_zawgyi_to_unicode(problematic_text)
# If conversion fails, returns original text with warning log
Acknowledgments
Zawgyi support relies on two open-source libraries:
| Library | Author | Purpose | License |
|---|
myanmartools | Google | Statistical Zawgyi detection using a Markov model | Apache 2.0 |
python-myanmar | trhura | Zawgyi-to-Unicode conversion | MIT |
We are grateful to Google and the open-source community for making these libraries publicly available.
Detection Accuracy
The myanmartools detector (by Google) uses a Markov model:
| Encoding | Detection Accuracy |
|---|
| Pure Zawgyi | >99% |
| Pure Unicode | >99% |
| Mixed (rare) | ~90% |
Threshold Recommendations
| Use Case | Threshold | Notes |
|---|
| General | 0.95 | Avoid false positives |
| Aggressive | 0.90 | Catch more Zawgyi |
| Conservative | 0.99 | Only clear Zawgyi |
Common Zawgyi Patterns
Visual differences between encodings:
| Feature | Unicode | Zawgyi |
|---|
| RA-YIT | မြ (U+1019 U+103C) | ျမ (U+103B U+1019) |
| Stacking | ဿ (U+103F) | သ္သ (U+101E U+1039 U+101E) |
| Medials (Ya+Wa) | ကျွန် | ကၽြန္ |
Mixed Content Handling
Sometimes text contains both Unicode and Zawgyi segments:
def normalize_mixed_content(text: str) -> str:
"""Convert all content to Unicode."""
from myspellchecker.text.zawgyi_support import (
get_zawgyi_detector,
convert_zawgyi_to_unicode,
)
detector = get_zawgyi_detector()
if not detector:
return text
prob = detector.get_zawgyi_probability(text)
if prob > 0.95:
return convert_zawgyi_to_unicode(text)
return text
Best Practices
- Always normalize first - Convert Zawgyi before any spell checking
- Preserve original for display - Keep the original text alongside converted version
- Log Zawgyi usage - Track Zawgyi input for migration monitoring
See Also