Skip to main content
Myanmar text processing is uniquely challenging due to its complex script rules, multiple encoding standards (Unicode vs. Zawgyi), and flexible character ordering. The myspellchecker library includes a robust normalization pipeline to handle these issues.

Overview

The normalization process ensures that text is in a consistent, canonical form before it reaches the spell checker. This is critical for matching dictionary entries correctly. The pipeline performs these steps:
  1. Zawgyi Detection & Conversion: Converts legacy Zawgyi encoding to standard Unicode.
  2. Unicode Normalization: Applies standard NFC (Normalization Form Canonical Composition).
  3. Zero-Width Removal: Strips invisible characters (ZWSP, ZWNJ, etc.) that confuse algorithms.
  4. Diacritic Reordering: Enforces canonical ordering of Myanmar diacritics (e.g., medial positions).
  5. Nasal Normalization: Standardizes variable nasal endings (e.g., န် vs ).

Usage

You can use the normalizer directly via the normalize module:
from myspellchecker.text.normalize import normalize_with_zawgyi_conversion

raw_text = "..." # Could be Zawgyi or messy Unicode
clean_text = normalize_with_zawgyi_conversion(raw_text)

Direct Function Access

For fine-grained control, you can access specific normalization functions:
from myspellchecker.text.normalize import (
    normalize,
    convert_zawgyi_to_unicode,
    is_likely_zawgyi,
)

# 1. Check for Zawgyi
is_zawgyi, confidence = is_likely_zawgyi(text)

# 2. Convert if needed
if is_zawgyi:
    text = convert_zawgyi_to_unicode(text)

# 3. Standard Normalize
text = normalize(text, form="NFC")

Features in Detail

1. Zawgyi Support

Legacy Zawgyi-One encoding is still prevalent. We use Google’s myanmartools (machine learning model) for high-accuracy detection (>95%) and python-myanmar for conversion.
  • Detection: Statistical analysis of character sequences.
  • Conversion: Rule-based mapping to Myanmar3 (Unicode).

2. Unicode Normalization (NFC)

Myanmar characters can often be represented in multiple ways (e.g., pre-composed vs. decomposed). We strictly enforce NFC (Normalization Form C) to ensure:
  • လုံး is one unit, not + ုံး.
  • Consistent hashing for dictionary lookups.

3. Diacritic Reordering

In Myanmar Unicode, diacritics must follow a specific order (Storage Order). However, typing often results in visual-order storage.
  • Example: medial-ra () vs medial-ya ().
  • Action: Reorders diacritics to the canonical sequence defined by the Unicode standard.

4. Nasal Ending Normalization

Myanmar phonology allows the /n/ sound to be written as န် (Na + Asat) or (Anusvara). These are often used interchangeably or incorrectly. Nasal normalization is handled through the PhoneticHasher with normalize_nasals=True:
from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher(normalize_nasals=True)
# Nasal normalization is applied as part of phonetic encoding
code = hasher.encode("နိုင်ငံ")

Performance

Core normalization routines (reordering, zero-width removal) are implemented in Cython (.pyx files compiled to C++ extensions) for maximum performance. This adds negligible overhead (<1ms) to the pipeline.