Skip to main content
The text validator checks Myanmar text for structural correctness, catching invalid character ordering, encoding artifacts (Zawgyi remnants), doubled diacritics, and other issues that indicate malformed input rather than spelling errors.

Overview

from myspellchecker.text.validator import validate_text, validate_word, ValidationIssue

# Validate a word (returns bool)
is_valid = validate_word("ကျောင်း")
if is_valid:
    print("Word is valid")

# Validate full text (returns ValidationResult)
result = validate_text("မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။")
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.value}: {description}")

ValidationIssue Enum

All validation issues are categorized using the ValidationIssue enum:
IssueDescriptionCategory
EXTENDED_MYANMARContains Extended Myanmar/Shan/Mon/Karen characters (U+1050-U+109F, Extended-A/B)Encoding
ZAWGYI_YA_ASATZawgyi ya-medial used as pseudo-asat (e.g., ငျး)Encoding
ZAWGYI_YA_TERMINALZawgyi ya-medial at word-final positionEncoding
ZAWGYI_YA_RAZawgyi ya+ra medial combinationEncoding
ASAT_BEFORE_VOWELAsat (်) appears before a vowel sign (invalid ordering)Structural
INCOMPLETE_VOWELIncomplete vowel pattern (e.g., vowel before asat, missing u-vowel in O-vowel)Structural
DIGIT_TONEMyanmar digit followed by tone markStructural
SCRAMBLED_ORDERScrambled character sequence (e.g., vowel-asat-vowel)Structural
INVALID_STARTWord starts with invalid character (not consonant, independent vowel, or digit)Structural
DOUBLED_DIACRITICDoubled vowel, medial, or invalid tone sequenceStructural
VIRAMA_AT_ENDVirama (္) at end of word (incomplete stacking)Structural
EMPTY_OR_WHITESPACEEmpty or whitespace-only inputStructural
KNOWN_INVALIDWord is in the curated known-invalid words listQuality
FRAGMENT_PATTERNSegmentation fragment (consonant + asat/tone only)Segmentation
DOUBLE_ENDINGDouble-ending artifact (e.g., valid word + fragment merged)Segmentation
INCOMPLETE_WORDIncomplete word (ends with medial, incomplete stacking, or bare consonant after medial)Segmentation
MIXED_LETTER_NUMERALMixed Myanmar letter and numeral (should be split)Quality
ASAT_INITIALAsat-initial fragment (consonant+asat at word start)Segmentation
COMPOUND_TRUNCATEDCompound word with truncated endingQuality
MISSING_E_VOWELMissing ေ in ောင pattern (common typo)Quality
PURE_NUMERALPure Myanmar numeral sequence (not a word)Quality
DOUBLED_CONSONANTTwo identical consonants only (segmentation artifact)Quality
INVALID_VOWEL_SEQUENCE_SYLLABLEInvalid vowel sequence (e.g., doubled i-vowels, ာု)Structural
BARE_CONSONANT_ENDWord ends with bare consonant without asatSegmentation
STACKED_CONSONANT_STARTWord starts with stacked consonant marker (္)Segmentation
MEDIAL_STARTWord starts with a medial (ျ ြ ွ ှ)Segmentation
DEPENDENT_VOWEL_STARTWord starts with a dependent vowel signSegmentation
GREAT_SA_STARTWord starts with Great Sa (ဿ)Segmentation
ASAT_ANUSVARA_SEQUENCEContains phonetically impossible ်ံ sequenceSegmentation
DOUBLED_INDEPENDENT_VOWELTwo identical independent vowels (OCR error)Segmentation

Core Functions

validate_word

Quick boolean validation check for a single word:
from myspellchecker.text.validator import validate_word

# Check a valid word (returns bool)
is_valid = validate_word("မြန်မာ")
print(is_valid)  # True

# Check word with Zawgyi artifacts
is_valid = validate_word("ေကာင္း")  # Zawgyi encoding
print(is_valid)  # False

# Check invalid syllable
is_valid = validate_word("ျက")  # Invalid start
print(is_valid)  # False

validate_text

Validates text and returns detailed issue information:
from myspellchecker.text.validator import validate_text

text = "မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။"
result = validate_text(text)

# ValidationResult has: is_valid, issues, cleaned_text
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.name}: {description}")

Validation Categories

Structural Validation

Checks Myanmar character structure rules using validate_text for detailed issues:
from myspellchecker.text.validator import validate_text

# Asat before vowel check
result = validate_text("ကျွန်ုပ်")  # Asat before vowel sign
# result.issues may contain (ValidationIssue.ASAT_BEFORE_VOWEL, "Asat before vowel: ်ု")

# Doubled diacritic check
result = validate_text("ကာါ")  # Doubled vowel signs
# result.issues may contain (ValidationIssue.DOUBLED_DIACRITIC, "Doubled vowel: ...")

# Virama at end of word
result = validate_text("က္")  # Incomplete stacking
# result.issues may contain (ValidationIssue.VIRAMA_AT_END, "Virama at word end")

Encoding Detection

Detects legacy Zawgyi encoding:
# Zawgyi detection patterns
ZAWGYI_PATTERNS = [
    "ေ" + consonant,  # Zawgyi vowel-first
    "္" followed by wrong char,  # Invalid stacking
    "\u1033",  # Zawgyi-specific codepoint
]

is_valid = validate_word("ေကာင္း")
# Returns: False (contains Zawgyi artifacts)

Quality Filters

Detects low-quality or incomplete words:
from myspellchecker.text.validator import (
    is_fragment_pattern,
    is_incomplete_word,
    is_truncated_word,
    is_quality_word,
)

# Fragment detection (returns Tuple[bool, Optional[str]])
is_frag, reason = is_fragment_pattern("င်း")  # (True, "description")

# Incomplete word detection (returns Tuple[bool, Optional[str]])
is_inc, reason = is_incomplete_word("ကျော")  # (True, "description")

# Truncation detection (frequency-based, second arg is a callable)
is_truncated_word("ချိန", lambda word: freq_dict.get(word, 0))  # (True, 'ချိန်')

# Overall quality check
is_quality_word("ကျောင်း")  # True - high quality

Known Invalid Words

A curated list of ~50 verified invalid words that commonly appear in corpora:
from myspellchecker.text.validator import KNOWN_INVALID_WORDS

# Example invalid words (from the set)
KNOWN_INVALID_WORDS = {
    "သည်မ",      # Truncated
    "သည်င်း",    # Invalid merge
    "ကို့",       # Invalid tone
    "တွင့်",      # Invalid ending
    # ... ~50 total
}

# Check if word is known invalid
if word in KNOWN_INVALID_WORDS:
    issues.append(ValidationIssue.KNOWN_INVALID)

Valid Pali/Sanskrit Endings

Whitelist of ~80 words with valid bare consonant endings (Pali/Sanskrit loanwords):
from myspellchecker.text.validator import VALID_PALI_BARE_ENDINGS

# Example valid Pali endings
VALID_PALI_BARE_ENDINGS = {
    "ဗုဒ္ဓ",     # Buddha
    "သံဃာ",     # Sangha
    "ဓမ္မ",      # Dhamma
    # ... ~80 total
}

# Used to avoid false positives on religious/formal terms

Extended Myanmar Detection

Detects Myanmar Extended-A and Extended-B characters:
# Extended ranges
EXTENDED_A = range(0xAA60, 0xAA80)  # U+AA60-AA7F
EXTENDED_B = range(0xA9E0, 0xA9FF)  # U+A9E0-A9FF

# These are used in minority languages (Shan, Mon, etc.)
is_valid = validate_word("ꩮꩯꩰ")
# Returns: False (contains Extended Myanmar characters)

# Use validate_text for detailed issue information
result = validate_text("ꩮꩯꩰ")
# result.issues contains (ValidationIssue.EXTENDED_MYANMAR, "Extended Myanmar char: ...")

Integration with SpellChecker

The validation module integrates with the main spell checker:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_rule_based_validation=True,  # Enable structural validation
    validation=ValidationConfig(
        use_zawgyi_detection=True,   # Enable Zawgyi detection
        strict_validation=True,         # Enable strict validation
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

# Structural issues are reported in result.errors
for error in result.errors:
    if "structural" in str(error.error_type):
        print(f"Structural issue: {error.text}")

Data Pipeline Integration

Used in the data pipeline to filter corpus words:
from myspellchecker.text.validator import validate_word, validate_text, is_quality_word

def filter_corpus(words):
    """Filter corpus to only include quality words."""
    quality_words = []
    for word in words:
        # validate_word returns bool (True if valid)
        is_valid = validate_word(word)

        if is_valid and is_quality_word(word):
            quality_words.append(word)

    return quality_words

Performance

OperationTimeNotes
validate_word<1msSingle word validation
validate_text~10ms/1K wordsBatch validation
Pattern matching<0.1msCompiled regex

Use Cases

Corpus Cleaning

# Clean corpus before building dictionary
from myspellchecker.text.validator import validate_word, validate_text, ValidationIssue

def clean_corpus(words):
    cleaned = []
    for word in words:
        # validate_word returns bool; use validate_text for detailed issues
        if validate_word(word):
            cleaned.append(word)
        else:
            # For finer control, use validate_text to inspect specific issues
            result = validate_text(word)
            low_severity_only = all(
                issue in {ValidationIssue.EXTENDED_MYANMAR, ValidationIssue.PURE_NUMERAL}
                for issue, _ in result.issues
            )
            if low_severity_only:
                cleaned.append(word)

    return cleaned

Quality Reporting

from collections import Counter
from myspellchecker.text.validator import validate_text

def quality_report(text):
    # validate_text returns a ValidationResult with is_valid and issues
    result = validate_text(text)

    issue_counts = Counter()
    for issue, description in result.issues:
        issue_counts[issue.name] += 1

    print("Quality Report:")
    for issue_name, count in issue_counts.most_common():
        print(f"  {issue_name}: {count}")

See Also