Zawgyi Support - mySpellChecker

Myanmar has two competing text encodings: Unicode (international standard) and Zawgyi (legacy, still widely used). Zawgyi text fed to a Unicode spell checker will fail validation entirely. mySpellChecker detects Zawgyi automatically and converts it to Unicode before checking.

Overview

from myspellchecker.text.zawgyi_support import (
    get_zawgyi_detector,
    is_zawgyi_converter_available,
    convert_zawgyi_to_unicode,
)

# Check if text is Zawgyi
detector = get_zawgyi_detector()
if detector:
    prob = detector.get_zawgyi_probability("ျမန္မာ")
    print(f"Zawgyi probability: {prob:.2f}")  # 0.99

# Convert to Unicode
unicode_text = convert_zawgyi_to_unicode("ျမန္မာ")
print(unicode_text)  # "မြန်မာ"

Background: Zawgyi vs Unicode

The Problem

Myanmar has two competing text encodings:

Unicode - International standard (recommended)
Zawgyi - Legacy encoding still widely used

Zawgyi text appears garbled in Unicode systems and vice versa:

Encoding	”Myanmar”	Codepoints
Unicode	မြန်မာ	U+1019 U+103C U+1014 U+103A U+1019 U+102C
Zawgyi	ျမန္မာ	U+103B U+1019 U+1014 U+1039 U+1019 U+102C

Why It Matters for Spell Checking

Zawgyi text fed to a Unicode spell checker will:

Fail syllable validation
Generate incorrect suggestions
Miss actual spelling errors

The solution: detect and convert Zawgyi to Unicode before spell checking.

Detection Markers

Zawgyi-specific patterns that indicate encoding:

Pattern	Unicode	Zawgyi
Medial Ra	ြ (U+103C)	ၾ, ႀ, ႂ, ႃ (various)
Kinzi	င် + ္	၎င္း (special)
Stacking	္ + consonant	Multiple variants
Tall AA	ါ	ါ with different code

Conversion Rules

Key character mappings during Zawgyi-to-Unicode conversion:

Unicode	Zawgyi	Description
ြ (U+103C)	ၾ/ႀ/ႂ/ႃ	Medial Ra variants
ု (U+102F)	ု (different code)	Below vowel U
ူ (U+1030)	ူ (different code)	Below vowel UU
ေ + C	C + ေ	Vowel E reordering

Functions

get_zawgyi_detector

Get or create a ZawgyiDetector instance (thread-safe singleton):

from myspellchecker.text.zawgyi_support import get_zawgyi_detector

detector = get_zawgyi_detector()

if detector:
    # Detector available
    prob = detector.get_zawgyi_probability("ျမန္မာ")
    if prob > 0.95:
        print("Text is Zawgyi encoded")
else:
    # myanmartools not installed
    print("Zawgyi detection unavailable")

Returns: ZawgyiDetector instance or None if myanmartools is not installed.

is_zawgyi_converter_available

Check if Zawgyi conversion is available:

from myspellchecker.text.zawgyi_support import is_zawgyi_converter_available

if is_zawgyi_converter_available():
    # python-myanmar package is installed
    print("Conversion available")
else:
    print("Install python-myanmar for conversion")

Returns: True if python-myanmar converter is available.

convert_zawgyi_to_unicode

Convert Zawgyi text to Unicode:

from myspellchecker.text.zawgyi_support import convert_zawgyi_to_unicode

# Zawgyi input
zawgyi_text = "ျမန္မာ"
unicode_text = convert_zawgyi_to_unicode(zawgyi_text)
print(unicode_text)  # "မြန်မာ"

# Already Unicode - unchanged
unicode_text = convert_zawgyi_to_unicode("မြန်မာ")
print(unicode_text)  # "မြန်မာ"

# Custom threshold
unicode_text = convert_zawgyi_to_unicode(text, threshold=0.90)

Parameters:

text - Text to convert (may be Zawgyi or Unicode)
threshold - Minimum Zawgyi probability to trigger conversion (default: 0.95)

Returns: Converted Unicode text, or original if not Zawgyi or conversion unavailable.

Dependencies

Both myanmartools and python-myanmar are core dependencies and are installed automatically with pip install myspellchecker. No additional installation is needed for Zawgyi support.

Integration

With Text Normalization

from myspellchecker.text.normalization_service import get_normalization_service

service = get_normalization_service()

# for_spell_checking does NOT convert Zawgyi — it's fast normalization only
normalized = service.for_spell_checking(text)
# Handles: NFC → zero-width removal → diacritic reordering
# (Zawgyi detection/conversion is handled separately by the validation pipeline)

# For Zawgyi conversion, use for_dictionary_lookup instead
normalized = service.for_dictionary_lookup(text)
# Handles: Zawgyi → Unicode → NFC → zero-width removal → diacritic reordering

With SpellChecker

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    validation=ValidationConfig(
        use_zawgyi_detection=True,   # Enable detection
        use_zawgyi_conversion=True,  # Auto-convert if detected
    )
)

checker = SpellChecker(config=config)
result = checker.check("ျမန္မာ")  # Zawgyi input
# Text is auto-converted before checking

Manual Preprocessing

from myspellchecker.text.zawgyi_support import (
    get_zawgyi_detector,
    convert_zawgyi_to_unicode,
)

def preprocess_text(text: str) -> str:
    """Preprocess text, converting Zawgyi if detected."""
    detector = get_zawgyi_detector()

    if detector:
        prob = detector.get_zawgyi_probability(text)
        if prob > 0.95:
            return convert_zawgyi_to_unicode(text)

    return text

# Use in pipeline
text = preprocess_text(user_input)
result = checker.check(text)

Thread Safety

Both functions use functools.lru_cache for thread-safe singleton patterns:

# Safe for concurrent use
import threading

def process_in_thread(text):
    detector = get_zawgyi_detector()  # Same instance across threads
    # ... process text

threads = [threading.Thread(target=process_in_thread, args=(t,)) for t in texts]

Error Handling

The module handles errors gracefully:

# Detection errors
try:
    prob = detector.get_zawgyi_probability(text)
except Exception:
    # Logged, returns 0.0

# Conversion errors
result = convert_zawgyi_to_unicode(problematic_text)
# If conversion fails, returns original text with warning log

Acknowledgments

Zawgyi support relies on two open-source libraries:

Library	Author	Purpose	License
`myanmartools`	Google	Statistical Zawgyi detection using a Markov model	Apache 2.0
`python-myanmar`	trhura	Zawgyi-to-Unicode conversion	MIT

We are grateful to Google and the open-source community for making these libraries publicly available.

Detection Accuracy

The myanmartools detector (by Google) uses a Markov model:

Encoding	Detection Accuracy
Pure Zawgyi	>99%
Pure Unicode	>99%
Mixed (rare)	~90%

Threshold Recommendations

Use Case	Threshold	Notes
General	0.95	Avoid false positives
Aggressive	0.90	Catch more Zawgyi
Conservative	0.99	Only clear Zawgyi

Common Zawgyi Patterns

Visual differences between encodings:

Feature	Unicode	Zawgyi
RA-YIT	မြ (U+1019 U+103C)	ျမ (U+103B U+1019)
Stacking	ဿ (U+103F)	သ္သ (U+101E U+1039 U+101E)
Medials (Ya+Wa)	ကျွန်	ကၽြန္

Mixed Content Handling

Sometimes text contains both Unicode and Zawgyi segments:

def normalize_mixed_content(text: str) -> str:
    """Convert all content to Unicode."""
    from myspellchecker.text.zawgyi_support import (
        get_zawgyi_detector,
        convert_zawgyi_to_unicode,
    )

    detector = get_zawgyi_detector()
    if not detector:
        return text

    prob = detector.get_zawgyi_probability(text)
    if prob > 0.95:
        return convert_zawgyi_to_unicode(text)

    return text

Best Practices

Always normalize first - Convert Zawgyi before any spell checking
Preserve original for display - Keep the original text alongside converted version
Log Zawgyi usage - Track Zawgyi input for migration monitoring

​Overview

​Background: Zawgyi vs Unicode

​The Problem

​Why It Matters for Spell Checking

​Detection Markers

​Conversion Rules

​Functions

​get_zawgyi_detector

​is_zawgyi_converter_available

​convert_zawgyi_to_unicode

​Dependencies

​Integration

​With Text Normalization

​With SpellChecker

​Manual Preprocessing

​Thread Safety

​Error Handling

​Acknowledgments

​Detection Accuracy

​Threshold Recommendations

​Common Zawgyi Patterns

​Mixed Content Handling

​Best Practices

​See Also