Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The text validator checks Myanmar text for structural correctness, catching invalid character ordering, encoding artifacts (Zawgyi remnants), doubled diacritics, and other issues that indicate malformed input rather than spelling errors.
Overview
from myspellchecker.text.validator import validate_text, validate_word, ValidationIssue
# Validate a word (returns bool)
is_valid = validate_word("ကျောင်း")
if is_valid:
print("Word is valid")
# Validate full text (returns ValidationResult)
result = validate_text("မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။")
if not result.is_valid:
for issue, description in result.issues:
print(f"{issue.value}: {description}")
ValidationIssue Enum
All validation issues are categorized using the ValidationIssue enum:
| Issue | Description | Category |
|---|
EXTENDED_MYANMAR | Contains Extended Myanmar/Shan/Mon/Karen characters (U+1050-U+109F, Extended-A/B) | Encoding |
ZAWGYI_YA_ASAT | Zawgyi ya-medial used as pseudo-asat (e.g., ငျး) | Encoding |
ZAWGYI_YA_TERMINAL | Zawgyi ya-medial at word-final position | Encoding |
ZAWGYI_YA_RA | Zawgyi ya+ra medial combination | Encoding |
ASAT_BEFORE_VOWEL | Asat (်) appears before a vowel sign (invalid ordering) | Structural |
INCOMPLETE_VOWEL | Incomplete vowel pattern (e.g., vowel before asat, missing u-vowel in O-vowel) | Structural |
DIGIT_TONE | Myanmar digit followed by tone mark | Structural |
SCRAMBLED_ORDER | Scrambled character sequence (e.g., vowel-asat-vowel) | Structural |
INVALID_START | Word starts with invalid character (not consonant, independent vowel, or digit) | Structural |
DOUBLED_DIACRITIC | Doubled vowel, medial, or invalid tone sequence | Structural |
VIRAMA_AT_END | Virama (္) at end of word (incomplete stacking) | Structural |
EMPTY_OR_WHITESPACE | Empty or whitespace-only input | Structural |
KNOWN_INVALID | Word is in the curated known-invalid words list | Quality |
FRAGMENT_PATTERN | Segmentation fragment (consonant + asat/tone only) | Segmentation |
DOUBLE_ENDING | Double-ending artifact (e.g., valid word + fragment merged) | Segmentation |
INCOMPLETE_WORD | Incomplete word (ends with medial, incomplete stacking, or bare consonant after medial) | Segmentation |
MIXED_LETTER_NUMERAL | Mixed Myanmar letter and numeral (should be split) | Quality |
ASAT_INITIAL | Asat-initial fragment (consonant+asat at word start) | Segmentation |
COMPOUND_TRUNCATED | Compound word with truncated ending | Quality |
MISSING_E_VOWEL | Missing ေ in ောင pattern (common typo) | Quality |
PURE_NUMERAL | Pure Myanmar numeral sequence (not a word) | Quality |
DOUBLED_CONSONANT | Two identical consonants only (segmentation artifact) | Quality |
INVALID_VOWEL_SEQUENCE_SYLLABLE | Invalid vowel sequence (e.g., doubled i-vowels, ာု) | Structural |
BARE_CONSONANT_END | Word ends with bare consonant without asat | Segmentation |
STACKED_CONSONANT_START | Word starts with stacked consonant marker (္) | Segmentation |
MEDIAL_START | Word starts with a medial (ျ ြ ွ ှ) | Segmentation |
DEPENDENT_VOWEL_START | Word starts with a dependent vowel sign | Segmentation |
GREAT_SA_START | Word starts with Great Sa (ဿ) | Segmentation |
ASAT_ANUSVARA_SEQUENCE | Contains phonetically impossible ်ံ sequence | Segmentation |
DOUBLED_INDEPENDENT_VOWEL | Two identical independent vowels (OCR error) | Segmentation |
Core Functions
validate_word
Quick boolean validation check for a single word:
from myspellchecker.text.validator import validate_word
# Check a valid word (returns bool)
is_valid = validate_word("မြန်မာ")
print(is_valid) # True
# Check word with Zawgyi artifacts
is_valid = validate_word("ေကာင္း") # Zawgyi encoding
print(is_valid) # False
# Check invalid syllable
is_valid = validate_word("ျက") # Invalid start
print(is_valid) # False
validate_text
Validates text and returns detailed issue information:
from myspellchecker.text.validator import validate_text
text = "မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။"
result = validate_text(text)
# ValidationResult has: is_valid, issues, cleaned_text
if not result.is_valid:
for issue, description in result.issues:
print(f"{issue.name}: {description}")
Validation Categories
Structural Validation
Checks Myanmar character structure rules using validate_text for detailed issues:
from myspellchecker.text.validator import validate_text
# Asat before vowel check
result = validate_text("ကျွန်ုပ်") # Asat before vowel sign
# result.issues may contain (ValidationIssue.ASAT_BEFORE_VOWEL, "Asat before vowel: ်ု")
# Doubled diacritic check
result = validate_text("ကာါ") # Doubled vowel signs
# result.issues may contain (ValidationIssue.DOUBLED_DIACRITIC, "Doubled vowel: ...")
# Virama at end of word
result = validate_text("က္") # Incomplete stacking
# result.issues may contain (ValidationIssue.VIRAMA_AT_END, "Virama at word end")
Encoding Detection
Detects legacy Zawgyi encoding:
# Zawgyi detection patterns
ZAWGYI_PATTERNS = [
"ေ" + consonant, # Zawgyi vowel-first
"္" followed by wrong char, # Invalid stacking
"\u1033", # Zawgyi-specific codepoint
]
is_valid = validate_word("ေကာင္း")
# Returns: False (contains Zawgyi artifacts)
Quality Filters
Detects low-quality or incomplete words:
from myspellchecker.text.validator import (
is_fragment_pattern,
is_incomplete_word,
is_truncated_word,
is_quality_word,
)
# Fragment detection (returns Tuple[bool, Optional[str]])
is_frag, reason = is_fragment_pattern("င်း") # (True, "description")
# Incomplete word detection (returns Tuple[bool, Optional[str]])
is_inc, reason = is_incomplete_word("ကျော") # (True, "description")
# Truncation detection (frequency-based, second arg is a callable)
is_truncated_word("ချိန", lambda word: freq_dict.get(word, 0)) # (True, 'ချိန်')
# Overall quality check
is_quality_word("ကျောင်း") # True - high quality
Known Invalid Words
A curated list of ~50 verified invalid words that commonly appear in corpora:
from myspellchecker.text.validator import KNOWN_INVALID_WORDS
# Example invalid words (from the set)
KNOWN_INVALID_WORDS = {
"သည်မ", # Truncated
"သည်င်း", # Invalid merge
"ကို့", # Invalid tone
"တွင့်", # Invalid ending
# ... ~50 total
}
# Check if word is known invalid
if word in KNOWN_INVALID_WORDS:
issues.append(ValidationIssue.KNOWN_INVALID)
Valid Pali/Sanskrit Endings
Whitelist of ~80 words with valid bare consonant endings (Pali/Sanskrit loanwords):
from myspellchecker.text.validator import VALID_PALI_BARE_ENDINGS
# Example valid Pali endings
VALID_PALI_BARE_ENDINGS = {
"ဗုဒ္ဓ", # Buddha
"သံဃာ", # Sangha
"ဓမ္မ", # Dhamma
# ... ~80 total
}
# Used to avoid false positives on religious/formal terms
Extended Myanmar Detection
Detects Myanmar Extended-A and Extended-B characters:
# Extended ranges
EXTENDED_A = range(0xAA60, 0xAA80) # U+AA60-AA7F
EXTENDED_B = range(0xA9E0, 0xA9FF) # U+A9E0-A9FF
# These are used in minority languages (Shan, Mon, etc.)
is_valid = validate_word("ꩮꩯꩰ")
# Returns: False (contains Extended Myanmar characters)
# Use validate_text for detailed issue information
result = validate_text("ꩮꩯꩰ")
# result.issues contains (ValidationIssue.EXTENDED_MYANMAR, "Extended Myanmar char: ...")
Integration with SpellChecker
The validation module integrates with the main spell checker:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig
config = SpellCheckerConfig(
use_rule_based_validation=True, # Enable structural validation
validation=ValidationConfig(
use_zawgyi_detection=True, # Enable Zawgyi detection
strict_validation=True, # Enable strict validation
)
)
checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")
# Structural issues are reported in result.errors
for error in result.errors:
if "structural" in str(error.error_type):
print(f"Structural issue: {error.text}")
Data Pipeline Integration
Used in the data pipeline to filter corpus words:
from myspellchecker.text.validator import validate_word, validate_text, is_quality_word
def filter_corpus(words):
"""Filter corpus to only include quality words."""
quality_words = []
for word in words:
# validate_word returns bool (True if valid)
is_valid = validate_word(word)
if is_valid and is_quality_word(word):
quality_words.append(word)
return quality_words
| Operation | Time | Notes |
|---|
validate_word | <1ms | Single word validation |
validate_text | ~10ms/1K words | Batch validation |
| Pattern matching | <0.1ms | Compiled regex |
Use Cases
Corpus Cleaning
# Clean corpus before building dictionary
from myspellchecker.text.validator import validate_word, validate_text, ValidationIssue
def clean_corpus(words):
cleaned = []
for word in words:
# validate_word returns bool; use validate_text for detailed issues
if validate_word(word):
cleaned.append(word)
else:
# For finer control, use validate_text to inspect specific issues
result = validate_text(word)
low_severity_only = all(
issue in {ValidationIssue.EXTENDED_MYANMAR, ValidationIssue.PURE_NUMERAL}
for issue, _ in result.issues
)
if low_severity_only:
cleaned.append(word)
return cleaned
Quality Reporting
from collections import Counter
from myspellchecker.text.validator import validate_text
def quality_report(text):
# validate_text returns a ValidationResult with is_valid and issues
result = validate_text(text)
issue_counts = Counter()
for issue, description in result.issues:
issue_counts[issue.name] += 1
print("Quality Report:")
for issue_name, count in issue_counts.most_common():
print(f" {issue_name}: {count}")
See Also