Skip to main content
Before any validation runs, input text must be normalized: Zawgyi converted, zero-width characters stripped, and character variants unified. The NormalizationService exposes purpose-specific methods so each component gets exactly the normalization it needs.

Overview

Text normalization ensures consistent text representation across all components. The NormalizationService provides purpose-specific methods for different use cases:
MethodUse CaseZawgyi ConversionZero-Width RemovalChar Variants
for_spell_checking()Validation pipelineNoYesNo
for_dictionary_lookup()Database queriesYesYesYes
for_comparison()Text comparisonYesYesNo
for_display()User outputNoNoNo
for_ingestion()Corpus buildingYesYesYes

NormalizationService

Basic Usage

from myspellchecker.text.normalization_service import (
    NormalizationService,
    get_normalization_service
)

# Get singleton service
service = get_normalization_service()

# Or create new instance
service = NormalizationService()

Spell Checking Normalization

Fast normalization for the validation pipeline (no Zawgyi conversion):
normalized = service.for_spell_checking("  မြန်မာ  ")
print(normalized)  # "မြန်မာ"
Pipeline:
  1. Strip whitespace
  2. Unicode NFC normalization
  3. Remove zero-width characters
  4. Myanmar diacritic reordering

Dictionary Lookup Normalization

Complete normalization for database queries:
normalized = service.for_dictionary_lookup(user_input)
Pipeline:
  1. Strip whitespace
  2. Zawgyi to Unicode conversion (if detected)
  3. Unicode NFC normalization
  4. Remove zero-width characters
  5. Myanmar diacritic reordering

Comparison Normalization

Aggressive normalization for text comparison:
a = service.for_comparison(user_input)
b = service.for_comparison(dictionary_entry)
if a == b:
    print("Match!")

Display Normalization

Minimal normalization preserving user formatting:
normalized = service.for_display(text)
Pipeline:
  1. Unicode NFC normalization
  2. Myanmar diacritic reordering
  3. Preserves whitespace and zero-width characters

Corpus Ingestion

Full normalization for building dictionaries:
normalized = service.for_ingestion(corpus_line)

NormalizationOptions

Customize normalization with options:
from myspellchecker.text.normalization_service import (
    NormalizationService,
    NormalizationOptions
)

options = NormalizationOptions(
    unicode_form="NFC",         # NFC, NFD, NFKC, NFKD
    remove_zero_width=True,     # Remove zero-width characters
    reorder_diacritics=True,    # Myanmar-specific reordering
    convert_zawgyi=False,       # Detect and convert Zawgyi
    strip_whitespace=True,      # Strip leading/trailing whitespace
    lowercase=False,            # Lowercase (for non-Myanmar text)
    character_variants=False    # Normalize Myanmar character variants (Tall-AA, etc.)
)

service = NormalizationService()
normalized = service.normalize(text, options)

Presets

Pre-defined presets for common use cases:
from myspellchecker.text.normalization_service import (
    PRESET_SPELL_CHECK,
    PRESET_DICTIONARY_LOOKUP,
    PRESET_COMPARISON,
    PRESET_DISPLAY,
    PRESET_INGESTION
)

service = NormalizationService()
normalized = service.normalize(text, PRESET_COMPARISON)

Preset Configuration

PresetUnicodeZero-WidthDiacriticsZawgyiWhitespaceChar Variants
SPELL_CHECKNFCRemoveReorderNoStripNo
DICTIONARY_LOOKUPNFCRemoveReorderConvertStripYes
COMPARISONNFCRemoveReorderConvertStripNo
DISPLAYNFCKeepReorderNoKeepNo
INGESTIONNFCRemoveReorderConvertStripYes

Myanmar Text Detection

Check if text is primarily Myanmar script:
service = NormalizationService()

is_myanmar = service.is_myanmar_text("မြန်မာ")  # True
is_myanmar = service.is_myanmar_text("Hello")  # False
is_myanmar = service.is_myanmar_text("Hello မြန်မာ")  # Depends on threshold

# Include Extended Myanmar blocks (Shan, Mon, etc.)
is_myanmar = service.is_myanmar_text(text, allow_extended=True)
ParameterTypeDefaultDescription
textstrrequiredText to check
allow_extendedboolFalseIf False, only core Burmese characters (U+1000-U+109F) count. If True, Extended Myanmar blocks also count as Myanmar.

Zawgyi Handling

The service automatically detects and converts Zawgyi encoding:
from myspellchecker.core.config.text_configs import ZawgyiConfig

# Custom Zawgyi configuration
zawgyi_config = ZawgyiConfig(
    conversion_threshold=0.9,     # Probability threshold for conversion
    myanmar_text_threshold=0.3    # Min Myanmar character ratio
)

service = NormalizationService(zawgyi_config=zawgyi_config)

# Will convert Zawgyi if probability >= 0.9
normalized = service.for_dictionary_lookup(potentially_zawgyi_text)

Convenience Functions

Module-level functions for quick access:
from myspellchecker.text.normalization_service import (
    normalize_for_spell_checking,
    normalize_for_lookup,
    normalize_for_comparison
)

# These use the default singleton service
normalized = normalize_for_spell_checking(text)
normalized = normalize_for_lookup(text)
normalized = normalize_for_comparison(text)

Cython Optimization

Core normalization functions are Cython-optimized:
# These are used internally by NormalizationService
from myspellchecker.text.normalize_c import (
    remove_zero_width_chars,      # Fast zero-width removal
    reorder_myanmar_diacritics,   # Diacritic reordering
    get_myanmar_ratio             # Myanmar character ratio
)

Thread Safety

The NormalizationService is thread-safe:
from concurrent.futures import ThreadPoolExecutor

service = get_normalization_service()  # Thread-safe singleton

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(service.for_spell_checking, text)
        for text in texts
    ]
    results = [f.result() for f in futures]

Integration

The normalization service is used throughout mySpellChecker:

In SpellChecker

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
# Internally uses NormalizationService for consistent normalization
result = checker.check(text)

In Data Pipeline

from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
# Uses for_ingestion() when processing corpus files
pipeline.build_database(input_files=["corpus.txt"], database_path="output.db")

In Providers

from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider()
# Uses normalized form before database queries
is_valid = provider.is_valid_word("မြန်မာ")

Normalization Steps

1. Unicode Normalization

Converts text to consistent Unicode form (NFC by default):
import unicodedata

# Composed form (NFC)
text = unicodedata.normalize("NFC", text)

2. Zero-Width Character Removal

Removes invisible characters that can cause matching issues:
  • Zero-width space (U+200B)
  • Zero-width non-joiner (U+200C)
  • Zero-width joiner (U+200D)

3. Myanmar Diacritic Reordering

Ensures consistent ordering of Myanmar diacritics:
# Example: the syllable ကော ("kaw")
# Before: U+1031 U+1000 U+102C  (ေ stored before က — non-canonical)
# After:  U+1000 U+1031 U+102C  (က before ေ — canonical Unicode order)
# Both render visually as: ကော
# The ေ vowel always appears to the left visually, regardless of codepoint order.

4. Zawgyi Detection and Conversion

Detects legacy Zawgyi encoding and converts to Unicode:
# Requires myanmar-tools package
# pip install myanmar-tools

Best Practices

  1. Use purpose-specific methods: Choose the right method for your use case
  2. Normalize at boundaries: Normalize input at system entry points
  3. Be consistent: Use the same normalization for related operations
  4. Handle Zawgyi: Enable Zawgyi conversion for user-facing input
  5. Cache results: The service uses singleton pattern for efficiency

See Also