Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt

Use this file to discover all available pages before exploring further.

Myanmar text processing is uniquely challenging due to its complex script rules, multiple encoding standards (Unicode vs. Zawgyi), and flexible character ordering. The myspellchecker library includes a robust normalization pipeline to handle these issues.

Overview

The normalization process ensures that text is in a consistent, canonical form before it reaches the spell checker. This is critical for matching dictionary entries correctly. The pipeline performs these steps:
  1. Zawgyi Detection & Conversion: Converts legacy Zawgyi encoding to standard Unicode.
  2. Unicode Normalization: Applies standard NFC (Normalization Form Canonical Composition).
  3. Zero-Width Removal: Strips invisible characters (ZWSP, ZWNJ, etc.) that confuse algorithms.
  4. Diacritic Reordering: Enforces canonical ordering of Myanmar diacritics (e.g., medial positions).
  5. Nasal Normalization: Standardizes variable nasal endings (e.g., န် vs ).

Usage

You can use the normalizer directly via the normalize module:
from myspellchecker.text.normalize import normalize_with_zawgyi_conversion

raw_text = "..." # Could be Zawgyi or messy Unicode
clean_text = normalize_with_zawgyi_conversion(raw_text)

Direct Function Access

For fine-grained control, you can access specific normalization functions:
from myspellchecker.text.normalize import (
    normalize,
    convert_zawgyi_to_unicode,
    is_likely_zawgyi,
)

# 1. Check for Zawgyi
is_zawgyi, confidence = is_likely_zawgyi(text)

# 2. Convert if needed
if is_zawgyi:
    text = convert_zawgyi_to_unicode(text)

# 3. Standard Normalize (all options shown with defaults)
text = normalize(
    text,
    form="NFC",
    remove_zero_width=True,
    reorder_diacritics=True,
    normalize_variants=False,
    normalize_tall_aa=True,
    normalize_u_asat=True,
)

Features in Detail

1. Zawgyi Support

Legacy Zawgyi-One encoding is still prevalent. We use Google’s myanmartools (machine learning model) for high-accuracy detection (>95%) and python-myanmar for conversion.
  • Detection: Statistical analysis of character sequences.
  • Conversion: Rule-based mapping to Myanmar3 (Unicode).

2. Unicode Normalization (NFC)

Myanmar characters can often be represented in multiple ways (e.g., pre-composed vs. decomposed). We strictly enforce NFC (Normalization Form C) to ensure:
  • လုံး is one unit, not + ုံး.
  • Consistent hashing for dictionary lookups.

3. Diacritic Reordering

In Myanmar Unicode, diacritics must follow a specific order (Storage Order). However, typing often results in visual-order storage.
  • Example: medial-ra () vs medial-ya ().
  • Action: Reorders diacritics to the canonical sequence defined by the Unicode standard.

4. Nasal Ending Normalization

Myanmar phonology allows the /n/ sound to be written as န် (Na + Asat) or (Anusvara). These are often used interchangeably or incorrectly. Nasal normalization is handled through the PhoneticHasher with normalize_nasals=True:
from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher(normalize_nasals=True)
# Nasal normalization is applied as part of phonetic encoding
code = hasher.encode("နိုင်ငံ")

Performance

Core normalization routines (reordering, zero-width removal) are implemented in Cython (.pyx files compiled to C++ extensions) for maximum performance. This adds negligible overhead (<1ms) to the pipeline.