Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Text flows through four validation layers in sequence — syllable structure, word lookup, grammar rules, and context probability — with each layer catching a different class of errors before passing clean tokens downstream. This page details every stage, from pre-processing through error aggregation.
Pipeline Overview
+-------------------+
| Input Text |
+---------+---------+
|
v
+-------------------+
| Text Normalization|
+---------+---------+
|
v
+-------------------+
| Syllable |
| Segmentation |
+---------+---------+
|
v
+-------------------+ +-------------------+
| Layer 1: Syllable |---->| Syllable Errors |
| Validation | +-------------------+
+---------+---------+
|
v
+-------------------+ +-------------------+
| Post-Normalization|---->| Detector Errors |
| Detectors (38) | +-------------------+
+---------+---------+
|
v
+-------------------+
| Word Assembly |
+---------+---------+
|
v
+-------------------+ +-------------------+
| Layer 2: Word |---->| Word Errors |
| Validation | +-------------------+
+---------+---------+
|
v
+-------------------+ +-------------------+
| Layer 2.5: |---->| Grammar Errors |
| Grammar Checking | +-------------------+
+---------+---------+
|
v
+-------------------+ +-------------------+
| Layer 3: Context |---->| Context Errors |
| Validation | +-------------------+
+---------+---------+
|
v
+-------------------+
| Result Assembly |
+---------+---------+
|
v
+-------------------+
| Response |
+-------------------+
Pre-Processing
Text Normalization
Before validation, text is normalized:
from myspellchecker.text.normalize import normalize
normalized = normalize(text)
# - Remove zero-width characters
# - Normalize Unicode (NFC)
# - Handle Zawgyi detection/conversion
# - Normalize whitespace
Normalization steps:
- Zero-width removal: Remove invisible characters
- Unicode normalization: NFC form for consistent comparison
- Zawgyi handling: Detect and optionally convert legacy encoding
- Whitespace normalization: Consistent spacing
Syllable Segmentation
Text is broken into syllables using Myanmar orthographic rules:
from myspellchecker.segmenters import DefaultSegmenter
segmenter = DefaultSegmenter()
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# ["မြန်", "မာ", "နိုင်", "ငံ"]
Segmentation uses:
- Consonant boundaries: Detect syllable starts
- Combining character rules: Group marks correctly
- Stacking rules: Handle complex consonant clusters
Layer 1: Syllable Validation
Purpose
Validate that each syllable is orthographically correct and exists in the dictionary.
Implementation
# Simplified pseudocode — actual constructor:
# SyllableValidator(config, segmenter, repository, symspell, syllable_rule_validator)
class SyllableValidator:
def __init__(self, config, segmenter, repository, symspell, syllable_rule_validator):
self.repository = repository
self.rule_validator = syllable_rule_validator
def validate(self, text: str) -> list[Error]:
errors = []
syllables = self.segmenter.segment_syllables(text)
for syllable in syllables:
# Check structure rules
if self.rule_validator and not self.rule_validator.validate(syllable):
errors.append(SyllableError(
text=syllable,
position=0,
suggestions=[],
))
continue
# Check dictionary
if not self.repository.is_valid_syllable(syllable):
errors.append(SyllableError(
text=syllable,
position=0,
suggestions=[],
))
return errors
Rule-Based Validation
Syllable structure rules from SyllableRuleValidator:
class SyllableRuleValidator:
"""Validates Myanmar syllable structure."""
# Valid consonants
CONSONANTS = set("ကခဂဃငစဆဇဈညဋဌဍဎဏတထဒဓနပဖဗဘမယရလဝသဟဠအ")
# Valid medials
MEDIALS = {"ျ", "ြ", "ွ", "ှ"}
# Valid vowels
VOWELS = {"ါ", "ာ", "ိ", "ီ", "ု", "ူ", "ေ", "ဲ", "ော", "ော်"}
def validate(self, syllable: str) -> bool:
"""Check if syllable follows valid structure."""
# Must start with consonant or independent vowel
if not syllable or syllable[0] not in self.CONSONANTS:
return False
# Check medial order
if not self._check_medial_order(syllable):
return False
# Check for invalid combinations
if self._has_invalid_combination(syllable):
return False
return True
Dictionary Lookup
# O(1) lookup in SQLite
is_valid = provider.is_valid_syllable("မြန်") # True
is_valid = provider.is_valid_syllable("xyz") # False
Error Coverage
- Typos: ~90% caught at this layer
- Invalid characters: 100% caught
- Structural errors: 100% caught
Post-Normalization Detectors
Purpose
Between syllable validation and word validation, 38 ordered detectors run unconditionally on the normalized text. These catch character-level and particle-level errors that require normalized input but don’t depend on word segmentation.
Implementation
The detectors are defined in core/detection_registry.py as an ordered sequence (POST_NORM_DETECTOR_SEQUENCE). Each entry maps to a _detect_* method inherited from detector mixins (PostNormalizationDetectorsMixin, SentenceDetectorsMixin, CollocationDetectionMixin, etc.):
from myspellchecker.core.detection_registry import POST_NORM_DETECTOR_SEQUENCE
# In SpellChecker._run_validation_layers():
for entry in POST_NORM_DETECTOR_SEQUENCE:
getattr(self, entry.method_name)(normalized_text, errors)
Detector Categories
The 38 detectors are grouped by category:
| Category | Detectors | Examples |
|---|
| Stacking and structural | 5 | Broken stacking, missing asat, missing visarga |
| Medial and particle confusion | 5 | Medial ya-pin/ya-yit, particle confusion, compound confusion |
| Token repair and frequency | 4 | Invalid token repair, frequency-dominant variants, broken compound morpheme |
| Particle and diacritic | 5 | Ha-htoe particle typos, aukmyit confusion, particle misuse |
| Context-aware | 4 | Homophone left-context, collocation errors, semantic agent implausibility |
| Sentence-level | 7 | Dangling particles, tense mismatch, negation mismatch, missing visarga |
| Register and style | 3 | Register mixing, informal with honorific |
| Post-processing | 5 | Vowel after asat, missing diacritic, unknown compound segments, broken compound space, punctuation errors |
Ordering is intentional. For example, _detect_broken_stacking must run before _detect_colloquial_contractions to prevent stacking errors from being claimed as colloquial variants. See the Component Diagram for the full detector registry.
Layer 2: Word Validation
Purpose
Verify that valid syllables form valid words, and provide suggestions for unknown words.
Word Assembly
Valid syllables are assembled into words using a longest-match algorithm
within the word validation layer. There is no separate WordAssembler class —
word assembly logic is integrated into the segmentation and validation pipeline.
Word assembly uses a longest-match algorithm (implemented within the validator/segmenter,
not as a standalone class):
# Conceptual algorithm (embedded in WordValidator and segmenters):
# 1. Start with all syllables
# 2. Find longest dictionary match
# 3. Record as word
# 4. Continue with remaining syllables
words = []
i = 0
while i < len(syllables):
for length in range(len(syllables) - i, 0, -1):
candidate = "".join(syllables[i:i+length])
if provider.is_valid_word(candidate):
words.append(candidate)
i += length
break
else:
words.append(syllables[i])
i += 1
Validation Steps
For unknown words, Layer 2 performs multiple checks before generating an error:
# Simplified pseudocode — actual constructor includes reduplication_engine, compound_resolver
class WordValidator:
def validate(self, text: str) -> list[Error]:
errors = []
words = self.segmenter.segment_words(text)
for word in words:
# Step 1: Dictionary lookup
if self.word_repository.is_valid_word(word):
continue
# Step 2: SymSpell compound check (edit distance 0)
if self._is_valid_compound(word):
continue
# Step 3: Reduplication validation # Checks AA, AABB, ABAB, RHYME patterns
if self._is_valid_reduplication(word):
continue
# Step 4: Compound synthesis via DP # Splits into N+N, V+V, N+V, V+N, ADJ+N patterns
if self._is_valid_compound_synthesis(word):
continue
# Step 5: Generate suggestions (incl. morpheme-level correction)
suggestions = self.suggestion_strategy.suggest(word, context)
errors.append(WordError(text=word, suggestions=suggestions))
return errors
Error Coverage
- Unknown words: 100% detected
- Compound errors: ~80% with suggestions
- Near-misses: ~95% with correct suggestion
- Productive compounds: Accepted without error (N+N, V+V, etc.)
- Productive reduplications: Accepted without error (AA, AABB, ABAB)
Layer 2.5: Grammar Checking
Purpose
Validate syntactic correctness using POS tags and grammar rules.
Implementation
Grammar checking is implemented through the SyntacticRuleChecker engine and
validation strategies within ContextValidator. There is no separate GrammarChecker
validator class - instead, grammar rules are applied as part of the context validation
pipeline.
from myspellchecker.grammar.engine import SyntacticRuleChecker
# Grammar checking via SyntacticRuleChecker
rule_checker = SyntacticRuleChecker(provider)
# Check word sequence (POS tags are looked up internally from provider)
words = ["ကျောင်း", "သွား", "မှာ"]
errors = rule_checker.check_sequence(words)
for position, error_word, suggestion in errors:
print(f"Position {position}: {error_word} -> {suggestion}")
Grammar Rules
Grammar rules are defined in YAML files (src/myspellchecker/rules/) and include:
- Subject particle must follow noun
- Object particle must follow noun/pronoun
- Sentence should end with final particle
- Question should have question marker
- Aspect markers must follow verbs
Specialized Checkers
The grammar system includes specialized checkers in src/myspellchecker/grammar/checkers/:
- AspectChecker: Validates aspect marker usage
- ClassifierChecker: Validates classifier usage
- CompoundChecker: Validates compound words
- MergedWordChecker: Detects incorrectly merged particle+verb sequences
- NegationChecker: Validates negation patterns
- RegisterChecker: Validates formal/informal register consistency
Error Coverage
- Particle errors: ~90% detected
- Verb agreement: ~85% detected
- Structure errors: ~80% detected
Layer 3: Context Validation
Purpose
Detect real-word errors where words are spelled correctly but used incorrectly.
N-gram Analysis
class ContextValidator(Validator):
# Strategy-based orchestrator — coordinates validation strategies
def __init__(self, config, segmenter, strategies=None, name_heuristic=None):
super().__init__(config)
self.segmenter = segmenter
self.strategies = strategies or [] # List[ValidationStrategy]
def validate(self, text: str) -> list[Error]:
errors = []
# Execute strategies in priority order:
# - ToneValidationStrategy (10) - Tone mark disambiguation
# - OrthographyValidationStrategy (15) - Medial order checks
# - SyntacticValidationStrategy (20) - Grammar rules
# - BrokenCompoundStrategy (25) - Wrongly split compounds
# - POSSequenceValidationStrategy (30) - POS patterns
# - QuestionStructureValidationStrategy (40) - Question structure
# - HomophoneValidationStrategy (45) - Homophone detection
# - ConfusableSemanticStrategy (48) - MLM-enhanced confusables (opt-in)
# - NgramContextValidationStrategy (50) - Statistical context
# - SemanticValidationStrategy (70) - AI-powered (optional)
for strategy in sorted(self.strategies, key=lambda s: s.priority()):
strategy_errors = strategy.validate(context)
errors.extend(strategy_errors)
return errors
Semantic Verification
For ambiguous cases, semantic checking provides deeper analysis:
# N-gram says "သွား" is unlikely after "ထမင်း"
# Semantic checker confirms: "စား" (0.85) >> "သွား" (0.03)
Error Coverage
- Real-word errors: ~85% detected
- Context misuse: ~80% detected
- Homograph disambiguation: ~90% with semantic
Pipeline Configuration
Validation Levels
Validation level is specified per-check, not in configuration:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel
checker = SpellChecker()
# Fast: Layer 1 only (syllable validation)
result = checker.check(text, level=ValidationLevel.SYLLABLE)
# Standard: Layers 1-2 (syllable + word validation)
result = checker.check(text, level=ValidationLevel.WORD)
# Full: All layers (with context checking enabled in config)
config = SpellCheckerConfig(
use_context_checker=True, # Enable Layer 3 context validation
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)
Layer Enable/Disable
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel
config = SpellCheckerConfig(
use_context_checker=True, # Enable Layer 3 context checking
use_rule_based_validation=True, # Enable grammar rules
# Semantic checking is disabled by default (no model_path configured)
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)
| Layer | Speed | Coverage |
|---|
| Syllable | Fast | ~90% of errors |
| Word | Moderate | 5-8% additional |
| Grammar | Moderate | Grammar-only |
| Context | Moderate | 5-10% additional |
| Semantic | Slow | Verification |
For measured end-to-end performance (F1 96.2% without semantic, 98.3% with semantic v2.3), see the benchmarks page.
Error Aggregation
All layer errors are combined into the final Response object. There is no
separate ResultAssembler class — error aggregation is handled directly by
SpellChecker when assembling results from each validation layer.
The aggregation process:
- Collect errors from each layer (syllable, word, grammar, context)
- Run suggestion reconstruction (
_reconstruct_compound_suggestions, etc.)
- Deduplicate errors at the same position via
_dedup_errors_by_position
- Deduplicate overlapping error spans via
_dedup_errors_by_span
- Apply suppression filters (low-value errors, NER entities)
- Return a
Response containing the filtered error list
# Error aggregation logic (embedded in SpellChecker._run_validation_layers):
# After all validation layers have appended errors to the shared list:
# Suggestion reconstruction + dedup pipeline
self._reconstruct_compound_suggestions(normalized_text, errors)
self._reconstruct_particle_compound_suggestions(normalized_text, errors)
self._inject_asat_visarga_candidates(normalized_text, errors)
self._reconstruct_morpheme_in_compound(normalized_text, errors)
# Remove duplicates — two complementary passes
self._dedup_errors_by_position(errors) # Same position → keep highest confidence
self._dedup_errors_by_span(errors) # Overlapping spans → keep most specific
# Suppress low-value errors and filter NER entities
self._suppress_low_value_syllable_errors(errors, text=normalized_text)
self._suppress_low_value_syntax_errors(errors, text=normalized_text)
self._filter_ner_entities(errors, normalized_text)
Next Steps