Text Validation - mySpellChecker

The text validator checks Myanmar text for structural correctness, catching invalid character ordering, encoding artifacts (Zawgyi remnants), doubled diacritics, and other issues that indicate malformed input rather than spelling errors.

Overview

from myspellchecker.text.validator import validate_text, validate_word, ValidationIssue

# Validate a word (returns bool)
is_valid = validate_word("ကျောင်း")
if is_valid:
    print("Word is valid")

# Validate full text (returns ValidationResult)
result = validate_text("မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။")
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.value}: {description}")

ValidationIssue Enum

All validation issues are categorized using the ValidationIssue enum:

Issue	Description	Category
`EXTENDED_MYANMAR`	Contains Extended Myanmar/Shan/Mon/Karen characters (U+1050-U+109F, Extended-A/B)	Encoding
`ZAWGYI_YA_ASAT`	Zawgyi ya-medial used as pseudo-asat (e.g., ငျး)	Encoding
`ZAWGYI_YA_TERMINAL`	Zawgyi ya-medial at word-final position	Encoding
`ZAWGYI_YA_RA`	Zawgyi ya+ra medial combination	Encoding
`ASAT_BEFORE_VOWEL`	Asat (်) appears before a vowel sign (invalid ordering)	Structural
`INCOMPLETE_VOWEL`	Incomplete vowel pattern (e.g., vowel before asat, missing u-vowel in O-vowel)	Structural
`DIGIT_TONE`	Myanmar digit followed by tone mark	Structural
`SCRAMBLED_ORDER`	Scrambled character sequence (e.g., vowel-asat-vowel)	Structural
`INVALID_START`	Word starts with invalid character (not consonant, independent vowel, or digit)	Structural
`DOUBLED_DIACRITIC`	Doubled vowel, medial, or invalid tone sequence	Structural
`VIRAMA_AT_END`	Virama (္) at end of word (incomplete stacking)	Structural
`EMPTY_OR_WHITESPACE`	Empty or whitespace-only input	Structural
`KNOWN_INVALID`	Word is in the curated known-invalid words list	Quality
`FRAGMENT_PATTERN`	Segmentation fragment (consonant + asat/tone only)	Segmentation
`DOUBLE_ENDING`	Double-ending artifact (e.g., valid word + fragment merged)	Segmentation
`INCOMPLETE_WORD`	Incomplete word (ends with medial, incomplete stacking, or bare consonant after medial)	Segmentation
`MIXED_LETTER_NUMERAL`	Mixed Myanmar letter and numeral (should be split)	Quality
`ASAT_INITIAL`	Asat-initial fragment (consonant+asat at word start)	Segmentation
`COMPOUND_TRUNCATED`	Compound word with truncated ending	Quality
`MISSING_E_VOWEL`	Missing ေ in ောင pattern (common typo)	Quality
`PURE_NUMERAL`	Pure Myanmar numeral sequence (not a word)	Quality
`DOUBLED_CONSONANT`	Two identical consonants only (segmentation artifact)	Quality
`INVALID_VOWEL_SEQUENCE_SYLLABLE`	Invalid vowel sequence (e.g., doubled i-vowels, ာု)	Structural
`BARE_CONSONANT_END`	Word ends with bare consonant without asat	Segmentation
`STACKED_CONSONANT_START`	Word starts with stacked consonant marker (္)	Segmentation
`MEDIAL_START`	Word starts with a medial (ျ ြ ွ ှ)	Segmentation
`DEPENDENT_VOWEL_START`	Word starts with a dependent vowel sign	Segmentation
`GREAT_SA_START`	Word starts with Great Sa (ဿ)	Segmentation
`ASAT_ANUSVARA_SEQUENCE`	Contains phonetically impossible ်ံ sequence	Segmentation
`DOUBLED_INDEPENDENT_VOWEL`	Two identical independent vowels (OCR error)	Segmentation

Core Functions

validate_word

Quick boolean validation check for a single word:

from myspellchecker.text.validator import validate_word

# Check a valid word (returns bool)
is_valid = validate_word("မြန်မာ")
print(is_valid)  # True

# Check word with Zawgyi artifacts
is_valid = validate_word("ေကာင္း")  # Zawgyi encoding
print(is_valid)  # False

# Check invalid syllable
is_valid = validate_word("ျက")  # Invalid start
print(is_valid)  # False

validate_text

Validates text and returns detailed issue information:

from myspellchecker.text.validator import validate_text

text = "မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။"
result = validate_text(text)

# ValidationResult has: is_valid, issues, cleaned_text
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.name}: {description}")

Validation Categories

Structural Validation

Checks Myanmar character structure rules using validate_text for detailed issues:

from myspellchecker.text.validator import validate_text

# Asat before vowel check
result = validate_text("ကျွန်ုပ်")  # Asat before vowel sign
# result.issues may contain (ValidationIssue.ASAT_BEFORE_VOWEL, "Asat before vowel: ်ု")

# Doubled diacritic check
result = validate_text("ကာါ")  # Doubled vowel signs
# result.issues may contain (ValidationIssue.DOUBLED_DIACRITIC, "Doubled vowel: ...")

# Virama at end of word
result = validate_text("က္")  # Incomplete stacking
# result.issues may contain (ValidationIssue.VIRAMA_AT_END, "Virama at word end")

Encoding Detection

Detects legacy Zawgyi encoding:

# Zawgyi detection patterns
ZAWGYI_PATTERNS = [
    "ေ" + consonant,  # Zawgyi vowel-first
    "္" followed by wrong char,  # Invalid stacking
    "\u1033",  # Zawgyi-specific codepoint
]

is_valid = validate_word("ေကာင္း")
# Returns: False (contains Zawgyi artifacts)

Quality Filters

Detects low-quality or incomplete words:

from myspellchecker.text.validator import (
    is_fragment_pattern,
    is_incomplete_word,
    is_truncated_word,
    is_quality_word,
)

# Fragment detection (returns Tuple[bool, Optional[str]])
is_frag, reason = is_fragment_pattern("င်း")  # (True, "description")

# Incomplete word detection (returns Tuple[bool, Optional[str]])
is_inc, reason = is_incomplete_word("ကျော")  # (True, "description")

# Truncation detection (frequency-based, second arg is a callable)
is_truncated_word("ချိန", lambda word: freq_dict.get(word, 0))  # (True, 'ချိန်')

# Overall quality check
is_quality_word("ကျောင်း")  # True - high quality

Known Invalid Words

A curated list of ~50 verified invalid words that commonly appear in corpora:

from myspellchecker.text.validator import KNOWN_INVALID_WORDS

# Example invalid words (from the set)
KNOWN_INVALID_WORDS = {
    "သည်မ",      # Truncated
    "သည်င်း",    # Invalid merge
    "ကို့",       # Invalid tone
    "တွင့်",      # Invalid ending
    # ... ~50 total
}

# Check if word is known invalid
if word in KNOWN_INVALID_WORDS:
    issues.append(ValidationIssue.KNOWN_INVALID)

Valid Pali/Sanskrit Endings

Whitelist of ~80 words with valid bare consonant endings (Pali/Sanskrit loanwords):

from myspellchecker.text.validator import VALID_PALI_BARE_ENDINGS

# Example valid Pali endings
VALID_PALI_BARE_ENDINGS = {
    "ဗုဒ္ဓ",     # Buddha
    "သံဃာ",     # Sangha
    "ဓမ္မ",      # Dhamma
    # ... ~80 total
}

# Used to avoid false positives on religious/formal terms

Extended Myanmar Detection

Detects Myanmar Extended-A and Extended-B characters:

# Extended ranges
EXTENDED_A = range(0xAA60, 0xAA80)  # U+AA60-AA7F
EXTENDED_B = range(0xA9E0, 0xA9FF)  # U+A9E0-A9FF

# These are used in minority languages (Shan, Mon, etc.)
is_valid = validate_word("ꩮꩯꩰ")
# Returns: False (contains Extended Myanmar characters)

# Use validate_text for detailed issue information
result = validate_text("ꩮꩯꩰ")
# result.issues contains (ValidationIssue.EXTENDED_MYANMAR, "Extended Myanmar char: ...")

Integration with SpellChecker

The validation module integrates with the main spell checker:

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_rule_based_validation=True,  # Enable structural validation
    validation=ValidationConfig(
        use_zawgyi_detection=True,   # Enable Zawgyi detection
        strict_validation=True,         # Enable strict validation
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

# Structural issues are reported in result.errors
for error in result.errors:
    if "structural" in str(error.error_type):
        print(f"Structural issue: {error.text}")

Data Pipeline Integration

Used in the data pipeline to filter corpus words:

from myspellchecker.text.validator import validate_word, validate_text, is_quality_word

def filter_corpus(words):
    """Filter corpus to only include quality words."""
    quality_words = []
    for word in words:
        # validate_word returns bool (True if valid)
        is_valid = validate_word(word)

        if is_valid and is_quality_word(word):
            quality_words.append(word)

    return quality_words

Performance

Operation	Time	Notes
`validate_word`	<1ms	Single word validation
`validate_text`	~10ms/1K words	Batch validation
Pattern matching	<0.1ms	Compiled regex

Use Cases

Corpus Cleaning

# Clean corpus before building dictionary
from myspellchecker.text.validator import validate_word, validate_text, ValidationIssue

def clean_corpus(words):
    cleaned = []
    for word in words:
        # validate_word returns bool; use validate_text for detailed issues
        if validate_word(word):
            cleaned.append(word)
        else:
            # For finer control, use validate_text to inspect specific issues
            result = validate_text(word)
            low_severity_only = all(
                issue in {ValidationIssue.EXTENDED_MYANMAR, ValidationIssue.PURE_NUMERAL}
                for issue, _ in result.issues
            )
            if low_severity_only:
                cleaned.append(word)

    return cleaned

Quality Reporting

from collections import Counter
from myspellchecker.text.validator import validate_text

def quality_report(text):
    # validate_text returns a ValidationResult with is_valid and issues
    result = validate_text(text)

    issue_counts = Counter()
    for issue, description in result.issues:
        issue_counts[issue.name] += 1

    print("Quality Report:")
    for issue_name, count in issue_counts.most_common():
        print(f"  {issue_name}: {count}")

​Overview

​ValidationIssue Enum

​Core Functions

​validate_word

​validate_text

​Validation Categories

​Structural Validation

​Encoding Detection

​Quality Filters

​Known Invalid Words

​Valid Pali/Sanskrit Endings

​Extended Myanmar Detection

​Integration with SpellChecker

​Data Pipeline Integration

​Performance

​Use Cases

​Corpus Cleaning

​Quality Reporting

​See Also