Text Normalization - mySpellChecker

Before any validation runs, input text must be normalized: Zawgyi converted, zero-width characters stripped, and character variants unified. The NormalizationService exposes purpose-specific methods so each component gets exactly the normalization it needs.

Overview

Text normalization ensures consistent text representation across all components. The NormalizationService provides purpose-specific methods for different use cases:

Method	Use Case	Zawgyi Conversion	Zero-Width Removal	Char Variants
`for_spell_checking()`	Validation pipeline	No	Yes	No
`for_dictionary_lookup()`	Database queries	Yes	Yes	Yes
`for_comparison()`	Text comparison	Yes	Yes	No
`for_display()`	User output	No	No	No
`for_ingestion()`	Corpus building	Yes	Yes	Yes

NormalizationService

Basic Usage

from myspellchecker.text.normalization_service import (
    NormalizationService,
    get_normalization_service
)

# Get singleton service
service = get_normalization_service()

# Or create new instance
service = NormalizationService()

Spell Checking Normalization

Fast normalization for the validation pipeline (no Zawgyi conversion):

normalized = service.for_spell_checking("  မြန်မာ  ")
print(normalized)  # "မြန်မာ"

Pipeline:

Strip whitespace
Unicode NFC normalization
Remove zero-width characters
Myanmar diacritic reordering

Dictionary Lookup Normalization

Complete normalization for database queries:

normalized = service.for_dictionary_lookup(user_input)

Pipeline:

Strip whitespace
Zawgyi to Unicode conversion (if detected)
Unicode NFC normalization
Remove zero-width characters
Myanmar diacritic reordering

Comparison Normalization

Aggressive normalization for text comparison:

a = service.for_comparison(user_input)
b = service.for_comparison(dictionary_entry)
if a == b:
    print("Match!")

Display Normalization

Minimal normalization preserving user formatting:

normalized = service.for_display(text)

Pipeline:

Unicode NFC normalization
Myanmar diacritic reordering
Preserves whitespace and zero-width characters

Corpus Ingestion

Full normalization for building dictionaries:

normalized = service.for_ingestion(corpus_line)

NormalizationOptions

Customize normalization with options:

from myspellchecker.text.normalization_service import (
    NormalizationService,
    NormalizationOptions
)

options = NormalizationOptions(
    unicode_form="NFC",         # NFC, NFD, NFKC, NFKD
    remove_zero_width=True,     # Remove zero-width characters
    reorder_diacritics=True,    # Myanmar-specific reordering
    convert_zawgyi=False,       # Detect and convert Zawgyi
    strip_whitespace=True,      # Strip leading/trailing whitespace
    lowercase=False,            # Lowercase (for non-Myanmar text)
    character_variants=False    # Normalize Myanmar character variants (Tall-AA, etc.)
)

service = NormalizationService()
normalized = service.normalize(text, options)

Presets

Pre-defined presets for common use cases:

from myspellchecker.text.normalization_service import (
    PRESET_SPELL_CHECK,
    PRESET_DICTIONARY_LOOKUP,
    PRESET_COMPARISON,
    PRESET_DISPLAY,
    PRESET_INGESTION
)

service = NormalizationService()
normalized = service.normalize(text, PRESET_COMPARISON)

Preset Configuration

Preset	Unicode	Zero-Width	Diacritics	Zawgyi	Whitespace	Char Variants
SPELL_CHECK	NFC	Remove	Reorder	No	Strip	No
DICTIONARY_LOOKUP	NFC	Remove	Reorder	Convert	Strip	Yes
COMPARISON	NFC	Remove	Reorder	Convert	Strip	No
DISPLAY	NFC	Keep	Reorder	No	Keep	No
INGESTION	NFC	Remove	Reorder	Convert	Strip	Yes

Myanmar Text Detection

Check if text is primarily Myanmar script:

service = NormalizationService()

is_myanmar = service.is_myanmar_text("မြန်မာ")  # True
is_myanmar = service.is_myanmar_text("Hello")  # False
is_myanmar = service.is_myanmar_text("Hello မြန်မာ")  # Depends on threshold

# Include Extended Myanmar blocks (Shan, Mon, etc.)
is_myanmar = service.is_myanmar_text(text, allow_extended=True)

Parameter	Type	Default	Description
`text`	`str`	required	Text to check
`allow_extended`	`bool`	`False`	If `False`, only core Burmese characters (U+1000-U+109F) count. If `True`, Extended Myanmar blocks also count as Myanmar.

Zawgyi Handling

The service automatically detects and converts Zawgyi encoding:

from myspellchecker.core.config.text_configs import ZawgyiConfig

# Custom Zawgyi configuration
zawgyi_config = ZawgyiConfig(
    conversion_threshold=0.9,     # Probability threshold for conversion
    myanmar_text_threshold=0.3    # Min Myanmar character ratio
)

service = NormalizationService(zawgyi_config=zawgyi_config)

# Will convert Zawgyi if probability >= 0.9
normalized = service.for_dictionary_lookup(potentially_zawgyi_text)

Convenience Functions

Module-level functions for quick access:

from myspellchecker.text.normalization_service import (
    normalize_for_spell_checking,
    normalize_for_lookup,
    normalize_for_comparison
)

# These use the default singleton service
normalized = normalize_for_spell_checking(text)
normalized = normalize_for_lookup(text)
normalized = normalize_for_comparison(text)

Cython Optimization

Core normalization functions are Cython-optimized:

# These are used internally by NormalizationService
from myspellchecker.text.normalize_c import (
    remove_zero_width_chars,      # Fast zero-width removal
    reorder_myanmar_diacritics,   # Diacritic reordering
    get_myanmar_ratio             # Myanmar character ratio
)

Thread Safety

The NormalizationService is thread-safe:

from concurrent.futures import ThreadPoolExecutor

service = get_normalization_service()  # Thread-safe singleton

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(service.for_spell_checking, text)
        for text in texts
    ]
    results = [f.result() for f in futures]

Integration

The normalization service is used throughout mySpellChecker:

In SpellChecker

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
# Internally uses NormalizationService for consistent normalization
result = checker.check(text)

In Data Pipeline

from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
# Uses for_ingestion() when processing corpus files
pipeline.build_database(input_files=["corpus.txt"], database_path="output.db")

In Providers

from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider()
# Uses normalized form before database queries
is_valid = provider.is_valid_word("မြန်မာ")

Normalization Steps

1. Unicode Normalization

Converts text to consistent Unicode form (NFC by default):

import unicodedata

# Composed form (NFC)
text = unicodedata.normalize("NFC", text)

2. Zero-Width Character Removal

Removes invisible characters that can cause matching issues:

Zero-width space (U+200B)
Zero-width non-joiner (U+200C)
Zero-width joiner (U+200D)

3. Myanmar Diacritic Reordering

Ensures consistent ordering of Myanmar diacritics:

# Example: the syllable ကော ("kaw")
# Before: U+1031 U+1000 U+102C  (ေ stored before က — non-canonical)
# After:  U+1000 U+1031 U+102C  (က before ေ — canonical Unicode order)
# Both render visually as: ကော
# The ေ vowel always appears to the left visually, regardless of codepoint order.

4. Zawgyi Detection and Conversion

Detects legacy Zawgyi encoding and converts to Unicode:

# Requires myanmar-tools package
# pip install myanmar-tools

Best Practices

Use purpose-specific methods: Choose the right method for your use case
Normalize at boundaries: Normalize input at system entry points
Be consistent: Use the same normalization for related operations
Handle Zawgyi: Enable Zawgyi conversion for user-facing input
Cache results: The service uses singleton pattern for efficiency

​Overview

​NormalizationService

​Basic Usage

​Spell Checking Normalization

​Dictionary Lookup Normalization

​Comparison Normalization

​Display Normalization

​Corpus Ingestion

​NormalizationOptions

​Presets

​Preset Configuration

​Myanmar Text Detection

​Zawgyi Handling

​Convenience Functions

​Cython Optimization

​Thread Safety

​Integration

​In SpellChecker

​In Data Pipeline

​In Providers

​Normalization Steps

​1. Unicode Normalization

​2. Zero-Width Character Removal

​3. Myanmar Diacritic Reordering

​4. Zawgyi Detection and Conversion

​Best Practices

​See Also

Overview

NormalizationService

Basic Usage

Spell Checking Normalization

Dictionary Lookup Normalization

Comparison Normalization

Display Normalization

Corpus Ingestion

NormalizationOptions

Presets

Preset Configuration

Myanmar Text Detection

Zawgyi Handling

Convenience Functions

Cython Optimization

Thread Safety

Integration

In SpellChecker

In Data Pipeline

In Providers

Normalization Steps

1. Unicode Normalization

2. Zero-Width Character Removal

3. Myanmar Diacritic Reordering

4. Zawgyi Detection and Conversion

Best Practices

See Also