Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Before any validation runs, input text must be normalized: Zawgyi converted, zero-width characters stripped, and character variants unified. The NormalizationService exposes purpose-specific methods so each component gets exactly the normalization it needs.
Overview
Text normalization ensures consistent text representation across all components. The NormalizationService provides purpose-specific methods for different use cases:
| Method | Use Case | Zawgyi Conversion | Zero-Width Removal | Char Variants |
|---|
for_spell_checking() | Validation pipeline | No | Yes | No |
for_dictionary_lookup() | Database queries | Yes | Yes | Yes |
for_comparison() | Text comparison | Yes | Yes | No |
for_display() | User output | No | No | No |
for_ingestion() | Corpus building | Yes | Yes | Yes |
NormalizationService
Basic Usage
from myspellchecker.text.normalization_service import (
NormalizationService,
get_normalization_service
)
# Get singleton service
service = get_normalization_service()
# Or create new instance
service = NormalizationService()
Spell Checking Normalization
Fast normalization for the validation pipeline (no Zawgyi conversion):
normalized = service.for_spell_checking(" မြန်မာ ")
print(normalized) # "မြန်မာ"
Pipeline:
- Strip whitespace
- Unicode NFC normalization
- Remove zero-width characters
- Myanmar diacritic reordering
Dictionary Lookup Normalization
Complete normalization for database queries:
normalized = service.for_dictionary_lookup(user_input)
Pipeline:
- Strip whitespace
- Zawgyi to Unicode conversion (if detected)
- Unicode NFC normalization
- Remove zero-width characters
- Myanmar diacritic reordering
Comparison Normalization
Aggressive normalization for text comparison:
a = service.for_comparison(user_input)
b = service.for_comparison(dictionary_entry)
if a == b:
print("Match!")
Display Normalization
Minimal normalization preserving user formatting:
normalized = service.for_display(text)
Pipeline:
- Unicode NFC normalization
- Myanmar diacritic reordering
- Preserves whitespace and zero-width characters
Corpus Ingestion
Full normalization for building dictionaries:
normalized = service.for_ingestion(corpus_line)
NormalizationOptions
Customize normalization with options:
from myspellchecker.text.normalization_service import (
NormalizationService,
NormalizationOptions
)
options = NormalizationOptions(
unicode_form="NFC", # NFC, NFD, NFKC, NFKD
remove_zero_width=True, # Remove zero-width characters
reorder_diacritics=True, # Myanmar-specific reordering
convert_zawgyi=False, # Detect and convert Zawgyi
strip_whitespace=True, # Strip leading/trailing whitespace
lowercase=False, # Lowercase (for non-Myanmar text)
character_variants=False # Normalize Myanmar character variants (Tall-AA, etc.)
)
service = NormalizationService()
normalized = service.normalize(text, options)
Presets
Pre-defined presets for common use cases:
from myspellchecker.text.normalization_service import (
PRESET_SPELL_CHECK,
PRESET_DICTIONARY_LOOKUP,
PRESET_COMPARISON,
PRESET_DISPLAY,
PRESET_INGESTION
)
service = NormalizationService()
normalized = service.normalize(text, PRESET_COMPARISON)
Preset Configuration
| Preset | Unicode | Zero-Width | Diacritics | Zawgyi | Whitespace | Char Variants |
|---|
| SPELL_CHECK | NFC | Remove | Reorder | No | Strip | No |
| DICTIONARY_LOOKUP | NFC | Remove | Reorder | Convert | Strip | Yes |
| COMPARISON | NFC | Remove | Reorder | Convert | Strip | No |
| DISPLAY | NFC | Keep | Reorder | No | Keep | No |
| INGESTION | NFC | Remove | Reorder | Convert | Strip | Yes |
Myanmar Text Detection
Check if text is primarily Myanmar script:
service = NormalizationService()
is_myanmar = service.is_myanmar_text("မြန်မာ") # True
is_myanmar = service.is_myanmar_text("Hello") # False
is_myanmar = service.is_myanmar_text("Hello မြန်မာ") # Depends on threshold
# Include Extended Myanmar blocks (Shan, Mon, etc.)
is_myanmar = service.is_myanmar_text(text, allow_extended=True)
| Parameter | Type | Default | Description |
|---|
text | str | required | Text to check |
allow_extended | bool | False | If False, only core Burmese characters (U+1000-U+109F) count. If True, Extended Myanmar blocks also count as Myanmar. |
Zawgyi Handling
The service automatically detects and converts Zawgyi encoding:
from myspellchecker.core.config.text_configs import ZawgyiConfig
# Custom Zawgyi configuration
zawgyi_config = ZawgyiConfig(
conversion_threshold=0.9, # Probability threshold for conversion
myanmar_text_threshold=0.3 # Min Myanmar character ratio
)
service = NormalizationService(zawgyi_config=zawgyi_config)
# Will convert Zawgyi if probability >= 0.9
normalized = service.for_dictionary_lookup(potentially_zawgyi_text)
Convenience Functions
Module-level functions for quick access:
from myspellchecker.text.normalization_service import (
normalize_for_spell_checking,
normalize_for_lookup,
normalize_for_comparison
)
# These use the default singleton service
normalized = normalize_for_spell_checking(text)
normalized = normalize_for_lookup(text)
normalized = normalize_for_comparison(text)
Cython Optimization
Core normalization functions are Cython-optimized:
# These are used internally by NormalizationService
from myspellchecker.text.normalize_c import (
remove_zero_width_chars, # Fast zero-width removal
reorder_myanmar_diacritics, # Diacritic reordering
get_myanmar_ratio # Myanmar character ratio
)
Thread Safety
The NormalizationService is thread-safe:
from concurrent.futures import ThreadPoolExecutor
service = get_normalization_service() # Thread-safe singleton
with ThreadPoolExecutor(max_workers=4) as executor:
futures = [
executor.submit(service.for_spell_checking, text)
for text in texts
]
results = [f.result() for f in futures]
Integration
The normalization service is used throughout mySpellChecker:
In SpellChecker
from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
# Internally uses NormalizationService for consistent normalization
result = checker.check(text)
In Data Pipeline
from myspellchecker.data_pipeline import Pipeline
pipeline = Pipeline()
# Uses for_ingestion() when processing corpus files
pipeline.build_database(input_files=["corpus.txt"], database_path="output.db")
In Providers
from myspellchecker.providers import SQLiteProvider
provider = SQLiteProvider()
# Uses normalized form before database queries
is_valid = provider.is_valid_word("မြန်မာ")
Normalization Steps
1. Unicode Normalization
Converts text to consistent Unicode form (NFC by default):
import unicodedata
# Composed form (NFC)
text = unicodedata.normalize("NFC", text)
2. Zero-Width Character Removal
Removes invisible characters that can cause matching issues:
- Zero-width space (U+200B)
- Zero-width non-joiner (U+200C)
- Zero-width joiner (U+200D)
3. Myanmar Diacritic Reordering
Ensures consistent ordering of Myanmar diacritics:
# Example: the syllable ကော ("kaw")
# Before: U+1031 U+1000 U+102C (ေ stored before က — non-canonical)
# After: U+1000 U+1031 U+102C (က before ေ — canonical Unicode order)
# Both render visually as: ကော
# The ေ vowel always appears to the left visually, regardless of codepoint order.
4. Zawgyi Detection and Conversion
Detects legacy Zawgyi encoding and converts to Unicode:
# Requires myanmar-tools package
# pip install myanmar-tools
Best Practices
- Use purpose-specific methods: Choose the right method for your use case
- Normalize at boundaries: Normalize input at system entry points
- Be consistent: Use the same normalization for related operations
- Handle Zawgyi: Enable Zawgyi conversion for user-facing input
- Cache results: The service uses singleton pattern for efficiency
See Also