Overview
Background: Zawgyi vs Unicode
The Problem
Myanmar has two competing text encodings:- Unicode - International standard (recommended)
- Zawgyi - Legacy encoding still widely used
| Encoding | ”Myanmar” | Codepoints |
|---|---|---|
| Unicode | မြန်မာ | U+1019 U+103C U+1014 U+103A U+1019 U+102C |
| Zawgyi | ျမန္မာ | U+103B U+1019 U+1014 U+1039 U+1019 U+102C |
Why It Matters for Spell Checking
Zawgyi text fed to a Unicode spell checker will:- Fail syllable validation
- Generate incorrect suggestions
- Miss actual spelling errors
Detection Markers
Zawgyi-specific patterns that indicate encoding:| Pattern | Unicode | Zawgyi |
|---|---|---|
| Medial Ra | ြ (U+103C) | ၾ, ႀ, ႂ, ႃ (various) |
| Kinzi | င် + ္ | ၎င္း (special) |
| Stacking | ္ + consonant | Multiple variants |
| Tall AA | ါ | ါ with different code |
Conversion Rules
Key character mappings during Zawgyi-to-Unicode conversion:| Unicode | Zawgyi | Description |
|---|---|---|
| ြ (U+103C) | ၾ/ႀ/ႂ/ႃ | Medial Ra variants |
| ု (U+102F) | ု (different code) | Below vowel U |
| ူ (U+1030) | ူ (different code) | Below vowel UU |
| ေ + C | C + ေ | Vowel E reordering |
Functions
get_zawgyi_detector
Get or create a ZawgyiDetector instance (thread-safe singleton):ZawgyiDetector instance or None if myanmartools is not installed.
is_zawgyi_converter_available
Check if Zawgyi conversion is available:True if python-myanmar converter is available.
convert_zawgyi_to_unicode
Convert Zawgyi text to Unicode:text- Text to convert (may be Zawgyi or Unicode)threshold- Minimum Zawgyi probability to trigger conversion (default: 0.95)
Dependencies
Bothmyanmartools and python-myanmar are core dependencies and are installed automatically with pip install myspellchecker. No additional installation is needed for Zawgyi support.
Integration
With Text Normalization
With SpellChecker
Manual Preprocessing
Thread Safety
Both functions usefunctools.lru_cache for thread-safe singleton patterns:
Error Handling
The module handles errors gracefully:Acknowledgments
Zawgyi support relies on two open-source libraries:| Library | Author | Purpose | License |
|---|---|---|---|
myanmartools | Statistical Zawgyi detection using a Markov model | Apache 2.0 | |
python-myanmar | trhura | Zawgyi-to-Unicode conversion | MIT |
Detection Accuracy
Themyanmartools detector (by Google) uses a Markov model:
| Encoding | Detection Accuracy |
|---|---|
| Pure Zawgyi | >99% |
| Pure Unicode | >99% |
| Mixed (rare) | ~90% |
Threshold Recommendations
| Use Case | Threshold | Notes |
|---|---|---|
| General | 0.95 | Avoid false positives |
| Aggressive | 0.90 | Catch more Zawgyi |
| Conservative | 0.99 | Only clear Zawgyi |
Common Zawgyi Patterns
Visual differences between encodings:| Feature | Unicode | Zawgyi |
|---|---|---|
| RA-YIT | မြ (U+1019 U+103C) | ျမ (U+103B U+1019) |
| Stacking | ဿ (U+103F) | သ္သ (U+101E U+1039 U+101E) |
| Medials (Ya+Wa) | ကျွန် | ကၽြန္ |
Mixed Content Handling
Sometimes text contains both Unicode and Zawgyi segments:Best Practices
- Always normalize first - Convert Zawgyi before any spell checking
- Preserve original for display - Keep the original text alongside converted version
- Log Zawgyi usage - Track Zawgyi input for migration monitoring
See Also
- Text Normalization - Full normalization pipeline
- Text Validation - Zawgyi artifact detection
- Configuration Guide - Zawgyi options