Skip to main content
Myanmar has two competing text encodings — Unicode (international standard) and Zawgyi (legacy, still widely used). Zawgyi text fed to a Unicode spell checker will fail validation entirely. mySpellChecker detects Zawgyi automatically and converts it to Unicode before checking.

Overview

from myspellchecker.text.zawgyi_support import (
    get_zawgyi_detector,
    is_zawgyi_converter_available,
    convert_zawgyi_to_unicode,
)

# Check if text is Zawgyi
detector = get_zawgyi_detector()
if detector:
    prob = detector.get_zawgyi_probability("ျမန္မာ")
    print(f"Zawgyi probability: {prob:.2f}")  # 0.99

# Convert to Unicode
unicode_text = convert_zawgyi_to_unicode("ျမန္မာ")
print(unicode_text)  # "မြန်မာ"

Background: Zawgyi vs Unicode

The Problem

Myanmar has two competing text encodings:
  • Unicode - International standard (recommended)
  • Zawgyi - Legacy encoding still widely used
Zawgyi text appears garbled in Unicode systems and vice versa:
Encoding”Myanmar”Codepoints
Unicodeမြန်မာU+1019 U+103C U+1014 U+103A U+1019 U+102C
Zawgyiျမန္မာU+103B U+1019 U+1014 U+1039 U+1019 U+102C

Why It Matters for Spell Checking

Zawgyi text fed to a Unicode spell checker will:
  • Fail syllable validation
  • Generate incorrect suggestions
  • Miss actual spelling errors
The solution: detect and convert Zawgyi to Unicode before spell checking.

Detection Markers

Zawgyi-specific patterns that indicate encoding:
PatternUnicodeZawgyi
Medial Raြ (U+103C)ၾ, ႀ, ႂ, ႃ (various)
Kinziင် + ္၎င္း (special)
Stacking္ + consonantMultiple variants
Tall AAါ with different code

Conversion Rules

Key character mappings during Zawgyi-to-Unicode conversion:
UnicodeZawgyiDescription
ြ (U+103C)ၾ/ႀ/ႂ/ႃMedial Ra variants
ု (U+102F)ု (different code)Below vowel U
ူ (U+1030)ူ (different code)Below vowel UU
ေ + CC + ေVowel E reordering

Functions

get_zawgyi_detector

Get or create a ZawgyiDetector instance (thread-safe singleton):
from myspellchecker.text.zawgyi_support import get_zawgyi_detector

detector = get_zawgyi_detector()

if detector:
    # Detector available
    prob = detector.get_zawgyi_probability("ျမန္မာ")
    if prob > 0.95:
        print("Text is Zawgyi encoded")
else:
    # myanmartools not installed
    print("Zawgyi detection unavailable")
Returns: ZawgyiDetector instance or None if myanmartools is not installed.

is_zawgyi_converter_available

Check if Zawgyi conversion is available:
from myspellchecker.text.zawgyi_support import is_zawgyi_converter_available

if is_zawgyi_converter_available():
    # python-myanmar package is installed
    print("Conversion available")
else:
    print("Install python-myanmar for conversion")
Returns: True if python-myanmar converter is available.

convert_zawgyi_to_unicode

Convert Zawgyi text to Unicode:
from myspellchecker.text.zawgyi_support import convert_zawgyi_to_unicode

# Zawgyi input
zawgyi_text = "ျမန္မာ"
unicode_text = convert_zawgyi_to_unicode(zawgyi_text)
print(unicode_text)  # "မြန်မာ"

# Already Unicode - unchanged
unicode_text = convert_zawgyi_to_unicode("မြန်မာ")
print(unicode_text)  # "မြန်မာ"

# Custom threshold
unicode_text = convert_zawgyi_to_unicode(text, threshold=0.90)
Parameters:
  • text - Text to convert (may be Zawgyi or Unicode)
  • threshold - Minimum Zawgyi probability to trigger conversion (default: 0.95)
Returns: Converted Unicode text, or original if not Zawgyi or conversion unavailable.

Installation

Required Dependencies

# For detection only
pip install myanmartools>=1.2.1

# For detection and conversion
pip install myanmartools>=1.2.1 python-myanmar>=1.0.0

Package Detection

The module gracefully handles missing packages:
detector = get_zawgyi_detector()
# Returns None if myanmartools not installed
# Logs warning: "myanmartools not available for Zawgyi detection"

result = convert_zawgyi_to_unicode("ျမန္မာ")
# Returns original text if conversion unavailable

Integration

With Text Normalization

from myspellchecker.text.normalization_service import get_normalization_service

service = get_normalization_service()

# Zawgyi conversion is included in normalization
normalized = service.for_spell_checking(text)
# Handles: Zawgyi → Unicode → NFC → reordering

With SpellChecker

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    validation=ValidationConfig(
        use_zawgyi_detection=True,   # Enable detection
        use_zawgyi_conversion=True,  # Auto-convert if detected
    )
)

checker = SpellChecker(config=config)
result = checker.check("ျမန္မာ")  # Zawgyi input
# Text is auto-converted before checking

Manual Preprocessing

from myspellchecker.text.zawgyi_support import (
    get_zawgyi_detector,
    convert_zawgyi_to_unicode,
)

def preprocess_text(text: str) -> str:
    """Preprocess text, converting Zawgyi if detected."""
    detector = get_zawgyi_detector()

    if detector:
        prob = detector.get_zawgyi_probability(text)
        if prob > 0.95:
            return convert_zawgyi_to_unicode(text)

    return text

# Use in pipeline
text = preprocess_text(user_input)
result = checker.check(text)

Thread Safety

Both functions use functools.lru_cache for thread-safe singleton patterns:
# Safe for concurrent use
import threading

def process_in_thread(text):
    detector = get_zawgyi_detector()  # Same instance across threads
    # ... process text

threads = [threading.Thread(target=process_in_thread, args=(t,)) for t in texts]

Error Handling

The module handles errors gracefully:
# Detection errors
try:
    prob = detector.get_zawgyi_probability(text)
except Exception:
    # Logged, returns 0.0

# Conversion errors
result = convert_zawgyi_to_unicode(problematic_text)
# If conversion fails, returns original text with warning log

Acknowledgments

Zawgyi support relies on two open-source libraries:
LibraryAuthorPurposeLicense
myanmartoolsGoogleStatistical Zawgyi detection using a Markov modelApache 2.0
python-myanmarMyanmar Tools communityZawgyi-to-Unicode conversionMIT
We are grateful to Google and the Myanmar Tools community for making these libraries publicly available.

Detection Accuracy

The myanmartools detector (by Google) uses a Markov model:
EncodingDetection Accuracy
Pure Zawgyi>99%
Pure Unicode>99%
Mixed (rare)~90%

Threshold Recommendations

Use CaseThresholdNotes
General0.95Avoid false positives
Aggressive0.90Catch more Zawgyi
Conservative0.99Only clear Zawgyi

Common Zawgyi Patterns

Visual differences between encodings:
FeatureUnicodeZawgyi
RA-YITမြ (U+1019 U+103C)ျမ (U+103B U+1019)
Stackingဿ (U+103F)သ္သ (U+101E U+1039 U+101E)
Medials (Ya+Wa)ကျွန်ကၽြန္

Mixed Content Handling

Sometimes text contains both Unicode and Zawgyi segments:
def normalize_mixed_content(text: str) -> str:
    """Convert all content to Unicode."""
    from myspellchecker.text.zawgyi_support import (
        get_zawgyi_detector,
        convert_zawgyi_to_unicode,
    )

    detector = get_zawgyi_detector()
    if not detector:
        return text

    prob = detector.get_zawgyi_probability(text)
    if prob > 0.95:
        return convert_zawgyi_to_unicode(text)

    return text

Best Practices

  1. Always normalize first - Convert Zawgyi before any spell checking
  2. Preserve original for display - Keep the original text alongside converted version
  3. Log Zawgyi usage - Track Zawgyi input for migration monitoring

See Also