Text Normalization - mySpellChecker

Myanmar text processing is uniquely challenging due to its complex script rules, multiple encoding standards (Unicode vs. Zawgyi), and flexible character ordering. The myspellchecker library includes a robust normalization pipeline to handle these issues.

Overview

The normalization process ensures that text is in a consistent, canonical form before it reaches the spell checker. This is critical for matching dictionary entries correctly. The pipeline performs these steps:

Zawgyi Detection & Conversion: Converts legacy Zawgyi encoding to standard Unicode.
Unicode Normalization: Applies standard NFC (Normalization Form Canonical Composition).
Zero-Width Removal: Strips invisible characters (ZWSP, ZWNJ, etc.) that confuse algorithms.
Diacritic Reordering: Enforces canonical ordering of Myanmar diacritics (e.g., medial positions).
Nasal Normalization: Standardizes variable nasal endings (e.g., န် vs ံ).

Usage

You can use the normalizer directly via the normalize module:

from myspellchecker.text.normalize import normalize_with_zawgyi_conversion

raw_text = "..." # Could be Zawgyi or messy Unicode
clean_text = normalize_with_zawgyi_conversion(raw_text)

Direct Function Access

For fine-grained control, you can access specific normalization functions:

from myspellchecker.text.normalize import (
    normalize,
    convert_zawgyi_to_unicode,
    is_likely_zawgyi,
)

# 1. Check for Zawgyi
is_zawgyi, confidence = is_likely_zawgyi(text)

# 2. Convert if needed
if is_zawgyi:
    text = convert_zawgyi_to_unicode(text)

# 3. Standard Normalize (all options shown with defaults)
text = normalize(
    text,
    form="NFC",
    remove_zero_width=True,
    reorder_diacritics=True,
    normalize_variants=False,
    normalize_tall_aa=True,
    normalize_u_asat=True,
)

Features in Detail

1. Zawgyi Support

Legacy Zawgyi-One encoding is still prevalent. We use Google’s myanmartools (machine learning model) for high-accuracy detection (>95%) and python-myanmar for conversion.

Detection: Statistical analysis of character sequences.
Conversion: Rule-based mapping to Myanmar3 (Unicode).

2. Unicode Normalization (NFC)

Myanmar characters can often be represented in multiple ways (e.g., pre-composed vs. decomposed). We strictly enforce NFC (Normalization Form C) to ensure:

လုံး is one unit, not လ + ုံး.
Consistent hashing for dictionary lookups.

3. Diacritic Reordering

In Myanmar Unicode, diacritics must follow a specific order (Storage Order). However, typing often results in visual-order storage.

Example: medial-ra (ြ) vs medial-ya (ျ).
Action: Reorders diacritics to the canonical sequence defined by the Unicode standard.

4. Nasal Ending Normalization

Myanmar phonology allows the /n/ sound to be written as န် (Na + Asat) or ံ (Anusvara). These are often used interchangeably or incorrectly. Nasal normalization is handled through the PhoneticHasher with normalize_nasals=True:

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher(normalize_nasals=True)
# Nasal normalization is applied as part of phonetic encoding
code = hasher.encode("နိုင်ငံ")

Performance

Core normalization routines (reordering, zero-width removal) are implemented in Cython (.pyx files compiled to C++ extensions) for maximum performance. This adds negligible overhead (<1ms) to the pipeline.

​Overview

​Usage

​Direct Function Access

​Features in Detail

​1. Zawgyi Support

​2. Unicode Normalization (NFC)

​3. Diacritic Reordering

​4. Nasal Ending Normalization

​Performance