myspellchecker library includes a robust normalization pipeline to handle these issues.
Overview
The normalization process ensures that text is in a consistent, canonical form before it reaches the spell checker. This is critical for matching dictionary entries correctly. The pipeline performs these steps:- Zawgyi Detection & Conversion: Converts legacy Zawgyi encoding to standard Unicode.
- Unicode Normalization: Applies standard NFC (Normalization Form Canonical Composition).
- Zero-Width Removal: Strips invisible characters (ZWSP, ZWNJ, etc.) that confuse algorithms.
- Diacritic Reordering: Enforces canonical ordering of Myanmar diacritics (e.g., medial positions).
- Nasal Normalization: Standardizes variable nasal endings (e.g.,
န်vsံ).
Usage
You can use the normalizer directly via thenormalize module:
Direct Function Access
For fine-grained control, you can access specific normalization functions:Features in Detail
1. Zawgyi Support
Legacy Zawgyi-One encoding is still prevalent. We use Google’smyanmartools (machine learning model) for high-accuracy detection (>95%) and python-myanmar for conversion.
- Detection: Statistical analysis of character sequences.
- Conversion: Rule-based mapping to Myanmar3 (Unicode).
2. Unicode Normalization (NFC)
Myanmar characters can often be represented in multiple ways (e.g., pre-composed vs. decomposed). We strictly enforce NFC (Normalization Form C) to ensure:လုံးis one unit, notလ+ုံး.- Consistent hashing for dictionary lookups.
3. Diacritic Reordering
In Myanmar Unicode, diacritics must follow a specific order (Storage Order). However, typing often results in visual-order storage.- Example:
medial-ra(ြ) vsmedial-ya(ျ). - Action: Reorders diacritics to the canonical sequence defined by the Unicode standard.
4. Nasal Ending Normalization
Myanmar phonology allows the /n/ sound to be written asန် (Na + Asat) or ံ (Anusvara). These are often used interchangeably or incorrectly.
Nasal normalization is handled through the PhoneticHasher with normalize_nasals=True:
Performance
Core normalization routines (reordering, zero-width removal) are implemented in Cython (.pyx files compiled to C++ extensions) for maximum performance. This adds negligible overhead (<1ms) to the pipeline.