Overview
Background: Zawgyi vs Unicode
The Problem
Myanmar has two competing text encodings:- Unicode - International standard (recommended)
- Zawgyi - Legacy encoding still widely used
| Encoding | ”Myanmar” | Codepoints |
|---|---|---|
| Unicode | မြန်မာ | U+1019 U+103C U+1014 U+103A U+1019 U+102C |
| Zawgyi | ျမန္မာ | U+103B U+1019 U+1014 U+1039 U+1019 U+102C |
Why It Matters for Spell Checking
Zawgyi text fed to a Unicode spell checker will:- Fail syllable validation
- Generate incorrect suggestions
- Miss actual spelling errors
Detection Markers
Zawgyi-specific patterns that indicate encoding:| Pattern | Unicode | Zawgyi |
|---|---|---|
| Medial Ra | ြ (U+103C) | ၾ, ႀ, ႂ, ႃ (various) |
| Kinzi | င် + ္ | ၎င္း (special) |
| Stacking | ္ + consonant | Multiple variants |
| Tall AA | ါ | ါ with different code |
Conversion Rules
Key character mappings during Zawgyi-to-Unicode conversion:| Unicode | Zawgyi | Description |
|---|---|---|
| ြ (U+103C) | ၾ/ႀ/ႂ/ႃ | Medial Ra variants |
| ု (U+102F) | ု (different code) | Below vowel U |
| ူ (U+1030) | ူ (different code) | Below vowel UU |
| ေ + C | C + ေ | Vowel E reordering |
Functions
get_zawgyi_detector
Get or create a ZawgyiDetector instance (thread-safe singleton):ZawgyiDetector instance or None if myanmartools is not installed.
is_zawgyi_converter_available
Check if Zawgyi conversion is available:True if python-myanmar converter is available.
convert_zawgyi_to_unicode
Convert Zawgyi text to Unicode:text- Text to convert (may be Zawgyi or Unicode)threshold- Minimum Zawgyi probability to trigger conversion (default: 0.95)
Installation
Required Dependencies
Package Detection
The module gracefully handles missing packages:Integration
With Text Normalization
With SpellChecker
Manual Preprocessing
Thread Safety
Both functions usefunctools.lru_cache for thread-safe singleton patterns:
Error Handling
The module handles errors gracefully:Acknowledgments
Zawgyi support relies on two open-source libraries:| Library | Author | Purpose | License |
|---|---|---|---|
myanmartools | Statistical Zawgyi detection using a Markov model | Apache 2.0 | |
python-myanmar | Myanmar Tools community | Zawgyi-to-Unicode conversion | MIT |
Detection Accuracy
Themyanmartools detector (by Google) uses a Markov model:
| Encoding | Detection Accuracy |
|---|---|
| Pure Zawgyi | >99% |
| Pure Unicode | >99% |
| Mixed (rare) | ~90% |
Threshold Recommendations
| Use Case | Threshold | Notes |
|---|---|---|
| General | 0.95 | Avoid false positives |
| Aggressive | 0.90 | Catch more Zawgyi |
| Conservative | 0.99 | Only clear Zawgyi |
Common Zawgyi Patterns
Visual differences between encodings:| Feature | Unicode | Zawgyi |
|---|---|---|
| RA-YIT | မြ (U+1019 U+103C) | ျမ (U+103B U+1019) |
| Stacking | ဿ (U+103F) | သ္သ (U+101E U+1039 U+101E) |
| Medials (Ya+Wa) | ကျွန် | ကၽြန္ |
Mixed Content Handling
Sometimes text contains both Unicode and Zawgyi segments:Best Practices
- Always normalize first - Convert Zawgyi before any spell checking
- Preserve original for display - Keep the original text alongside converted version
- Log Zawgyi usage - Track Zawgyi input for migration monitoring
See Also
- Text Normalization - Full normalization pipeline
- Text Validation - Zawgyi artifact detection
- Configuration Guide - Zawgyi options