Skip to main content
Text flows through four validation layers in sequence — syllable structure, word lookup, grammar rules, and context probability — with each layer catching a different class of errors before passing clean tokens downstream. This page details every stage, from pre-processing through error aggregation.

Pipeline Overview

  +-------------------+
  | Input Text        |
  +---------+---------+
            |
            v
  +-------------------+
  | Text Normalization|
  +---------+---------+
            |
            v
  +-------------------+
  | Syllable          |
  | Segmentation      |
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 1: Syllable |---->| Syllable Errors   |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Post-Normalization|---->| Detector Errors   |
  | Detectors (38)    |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+
  | Word Assembly     |
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 2: Word     |---->| Word Errors       |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 2.5:        |---->| Grammar Errors    |
  | Grammar Checking  |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 3: Context  |---->| Context Errors    |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+
  | Result Assembly   |
  +---------+---------+
            |
            v
  +-------------------+
  | Response          |
  +-------------------+

Pre-Processing

Text Normalization

Before validation, text is normalized:
from myspellchecker.text.normalize import normalize

normalized = normalize(text)
# - Remove zero-width characters
# - Normalize Unicode (NFC)
# - Handle Zawgyi detection/conversion
# - Normalize whitespace
Normalization steps:
  1. Zero-width removal: Remove invisible characters
  2. Unicode normalization: NFC form for consistent comparison
  3. Zawgyi handling: Detect and optionally convert legacy encoding
  4. Whitespace normalization: Consistent spacing

Syllable Segmentation

Text is broken into syllables using Myanmar orthographic rules:
from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter()
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# ["မြန်", "မာ", "နိုင်", "ငံ"]
Segmentation uses:
  • Consonant boundaries: Detect syllable starts
  • Combining character rules: Group marks correctly
  • Stacking rules: Handle complex consonant clusters

Layer 1: Syllable Validation

Purpose

Validate that each syllable is orthographically correct and exists in the dictionary.

Implementation

# Simplified pseudocode — actual constructor:
# SyllableValidator(config, segmenter, repository, symspell, syllable_rule_validator)
class SyllableValidator:
    def __init__(self, config, segmenter, repository, symspell, syllable_rule_validator):
        self.repository = repository
        self.rule_validator = syllable_rule_validator

    def validate(self, text: str) -> list[Error]:
        errors = []
        syllables = self.segmenter.segment_syllables(text)

        for syllable in syllables:
            # Check structure rules
            if self.rule_validator and not self.rule_validator.validate(syllable):
                errors.append(SyllableError(
                    text=syllable,
                    position=0,
                    suggestions=[],
                ))
                continue

            # Check dictionary
            if not self.repository.is_valid_syllable(syllable):
                errors.append(SyllableError(
                    text=syllable,
                    position=0,
                    suggestions=[],
                ))

        return errors

Rule-Based Validation

Syllable structure rules from SyllableRuleValidator:
class SyllableRuleValidator:
    """Validates Myanmar syllable structure."""

    # Valid consonants
    CONSONANTS = set("ကခဂဃငစဆဇဈညဋဌဍဎဏတထဒဓနပဖဗဘမယရလဝသဟဠအ")

    # Valid medials
    MEDIALS = {"ျ", "ြ", "ွ", "ှ"}

    # Valid vowels
    VOWELS = {"ါ", "ာ", "ိ", "ီ", "ု", "ူ", "ေ", "ဲ", "ော", "ော်"}

    def validate(self, syllable: str) -> bool:
        """Check if syllable follows valid structure."""
        # Must start with consonant or independent vowel
        if not syllable or syllable[0] not in self.CONSONANTS:
            return False

        # Check medial order
        if not self._check_medial_order(syllable):
            return False

        # Check for invalid combinations
        if self._has_invalid_combination(syllable):
            return False

        return True

Dictionary Lookup

# O(1) lookup in SQLite
is_valid = provider.is_valid_syllable("မြန်")  # True
is_valid = provider.is_valid_syllable("xyz")   # False

Error Coverage

  • Typos: ~90% caught at this layer
  • Invalid characters: 100% caught
  • Structural errors: 100% caught

Post-Normalization Detectors

Purpose

Between syllable validation and word validation, 38 ordered detectors run unconditionally on the normalized text. These catch character-level and particle-level errors that require normalized input but don’t depend on word segmentation.

Implementation

The detectors are defined in core/detection_registry.py as an ordered sequence (POST_NORM_DETECTOR_SEQUENCE). Each entry maps to a _detect_* method inherited from detector mixins (PostNormalizationDetectorsMixin, SentenceDetectorsMixin, CollocationDetectionMixin, etc.):
from myspellchecker.core.detection_registry import POST_NORM_DETECTOR_SEQUENCE

# In SpellChecker._run_validation_layers():
for entry in POST_NORM_DETECTOR_SEQUENCE:
    getattr(self, entry.method_name)(normalized_text, errors)

Detector Categories

The 38 detectors are grouped by category:
CategoryDetectorsExamples
Stacking and structural5Broken stacking, missing asat, missing visarga
Medial and particle confusion5Medial ya-pin/ya-yit, particle confusion, compound confusion
Token repair and frequency4Invalid token repair, frequency-dominant variants, broken compound morpheme
Particle and diacritic5Ha-htoe particle typos, aukmyit confusion, particle misuse
Context-aware4Homophone left-context, collocation errors, semantic agent implausibility
Sentence-level7Dangling particles, tense mismatch, negation mismatch, missing visarga
Register and style3Register mixing, informal with honorific
Post-processing5Vowel after asat, missing diacritic, unknown compound segments, broken compound space, punctuation errors
Ordering is intentional. For example, _detect_broken_stacking must run before _detect_colloquial_contractions to prevent stacking errors from being claimed as colloquial variants. See the Component Diagram for the full detector registry.

Layer 2: Word Validation

Purpose

Verify that valid syllables form valid words, and provide suggestions for unknown words.

Word Assembly

Valid syllables are assembled into words using a longest-match algorithm within the word validation layer. There is no separate WordAssembler class — word assembly logic is integrated into the segmentation and validation pipeline. Word assembly uses a longest-match algorithm (implemented within the validator/segmenter, not as a standalone class):
# Conceptual algorithm (embedded in WordValidator and segmenters):
# 1. Start with all syllables
# 2. Find longest dictionary match
# 3. Record as word
# 4. Continue with remaining syllables
words = []
i = 0
while i < len(syllables):
    for length in range(len(syllables) - i, 0, -1):
        candidate = "".join(syllables[i:i+length])
        if provider.is_valid_word(candidate):
            words.append(candidate)
            i += length
            break
    else:
        words.append(syllables[i])
        i += 1

Validation Steps

For unknown words, Layer 2 performs multiple checks before generating an error:
# Simplified pseudocode — actual constructor includes reduplication_engine, compound_resolver
class WordValidator:
    def validate(self, text: str) -> list[Error]:
        errors = []
        words = self.segmenter.segment_words(text)

        for word in words:
            # Step 1: Dictionary lookup
            if self.word_repository.is_valid_word(word):
                continue

            # Step 2: SymSpell compound check (edit distance 0)
            if self._is_valid_compound(word):
                continue

            # Step 3: Reduplication validation            # Checks AA, AABB, ABAB, RHYME patterns
            if self._is_valid_reduplication(word):
                continue

            # Step 4: Compound synthesis via DP            # Splits into N+N, V+V, N+V, V+N, ADJ+N patterns
            if self._is_valid_compound_synthesis(word):
                continue

            # Step 5: Generate suggestions (incl. morpheme-level correction)
            suggestions = self.suggestion_strategy.suggest(word, context)
            errors.append(WordError(text=word, suggestions=suggestions))

        return errors

Error Coverage

  • Unknown words: 100% detected
  • Compound errors: ~80% with suggestions
  • Near-misses: ~95% with correct suggestion
  • Productive compounds: Accepted without error (N+N, V+V, etc.)
  • Productive reduplications: Accepted without error (AA, AABB, ABAB)

Layer 2.5: Grammar Checking

Purpose

Validate syntactic correctness using POS tags and grammar rules.

Implementation

Grammar checking is implemented through the SyntacticRuleChecker engine and validation strategies within ContextValidator. There is no separate GrammarChecker validator class - instead, grammar rules are applied as part of the context validation pipeline.
from myspellchecker.grammar.engine import SyntacticRuleChecker

# Grammar checking via SyntacticRuleChecker
rule_checker = SyntacticRuleChecker(provider)

# Check word sequence (POS tags are looked up internally from provider)
words = ["ကျောင်း", "သွား", "မှာ"]
errors = rule_checker.check_sequence(words)

for position, error_word, suggestion in errors:
    print(f"Position {position}: {error_word} -> {suggestion}")

Grammar Rules

Grammar rules are defined in YAML files (src/myspellchecker/rules/) and include:
  • Subject particle must follow noun
  • Object particle must follow noun/pronoun
  • Sentence should end with final particle
  • Question should have question marker
  • Aspect markers must follow verbs

Specialized Checkers

The grammar system includes specialized checkers in src/myspellchecker/grammar/checkers/:
  • AspectChecker: Validates aspect marker usage
  • ClassifierChecker: Validates classifier usage
  • CompoundChecker: Validates compound words
  • MergedWordChecker: Detects incorrectly merged particle+verb sequences
  • NegationChecker: Validates negation patterns
  • RegisterChecker: Validates formal/informal register consistency

Error Coverage

  • Particle errors: ~90% detected
  • Verb agreement: ~85% detected
  • Structure errors: ~80% detected

Layer 3: Context Validation

Purpose

Detect real-word errors where words are spelled correctly but used incorrectly.

N-gram Analysis

class ContextValidator(Validator):
    # Strategy-based orchestrator — coordinates validation strategies
    def __init__(self, config, segmenter, strategies=None, name_heuristic=None):
        super().__init__(config)
        self.segmenter = segmenter
        self.strategies = strategies or []  # List[ValidationStrategy]

    def validate(self, text: str) -> list[Error]:
        errors = []
        # Execute strategies in priority order:
        # - ToneValidationStrategy (10) - Tone mark disambiguation
        # - OrthographyValidationStrategy (15) - Medial order checks
        # - SyntacticValidationStrategy (20) - Grammar rules
        # - BrokenCompoundStrategy (25) - Wrongly split compounds
        # - POSSequenceValidationStrategy (30) - POS patterns
        # - QuestionStructureValidationStrategy (40) - Question structure
        # - HomophoneValidationStrategy (45) - Homophone detection
        # - ConfusableSemanticStrategy (48) - MLM-enhanced confusables (opt-in)
        # - NgramContextValidationStrategy (50) - Statistical context
        # - SemanticValidationStrategy (70) - AI-powered (optional)
        for strategy in sorted(self.strategies, key=lambda s: s.priority()):
            strategy_errors = strategy.validate(context)
            errors.extend(strategy_errors)

        return errors

Semantic Verification

For ambiguous cases, semantic checking provides deeper analysis:
# N-gram says "သွား" is unlikely after "ထမင်း"
# Semantic checker confirms: "စား" (0.85) >> "သွား" (0.03)

Error Coverage

  • Real-word errors: ~85% detected
  • Context misuse: ~80% detected
  • Homograph disambiguation: ~90% with semantic

Pipeline Configuration

Validation Levels

Validation level is specified per-check, not in configuration:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel

checker = SpellChecker()

# Fast: Layer 1 only (syllable validation)
result = checker.check(text, level=ValidationLevel.SYLLABLE)

# Standard: Layers 1-2 (syllable + word validation)
result = checker.check(text, level=ValidationLevel.WORD)

# Full: All layers (with context checking enabled in config)
config = SpellCheckerConfig(
    use_context_checker=True,  # Enable Layer 3 context validation
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)

Layer Enable/Disable

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel

config = SpellCheckerConfig(
    use_context_checker=True,      # Enable Layer 3 context checking
    use_rule_based_validation=True, # Enable grammar rules
    # Semantic checking is disabled by default (no model_path configured)
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)

Performance Characteristics

LayerSpeedCoverage
SyllableFast~90% of errors
WordModerate5-8% additional
GrammarModerateGrammar-only
ContextModerate5-10% additional
SemanticSlowVerification
For measured end-to-end performance (F1 96.2% without semantic, 98.3% with semantic v2.3), see the benchmarks page.

Error Aggregation

All layer errors are combined into the final Response object. There is no separate ResultAssembler class — error aggregation is handled directly by SpellChecker when assembling results from each validation layer. The aggregation process:
  1. Collect errors from each layer (syllable, word, grammar, context)
  2. Run suggestion reconstruction (_reconstruct_compound_suggestions, etc.)
  3. Deduplicate errors at the same position via _dedup_errors_by_position
  4. Deduplicate overlapping error spans via _dedup_errors_by_span
  5. Apply suppression filters (low-value errors, NER entities)
  6. Return a Response containing the filtered error list
# Error aggregation logic (embedded in SpellChecker._run_validation_layers):
# After all validation layers have appended errors to the shared list:

# Suggestion reconstruction + dedup pipeline
self._reconstruct_compound_suggestions(normalized_text, errors)
self._reconstruct_particle_compound_suggestions(normalized_text, errors)
self._inject_asat_visarga_candidates(normalized_text, errors)
self._reconstruct_morpheme_in_compound(normalized_text, errors)

# Remove duplicates — two complementary passes
self._dedup_errors_by_position(errors)   # Same position → keep highest confidence
self._dedup_errors_by_span(errors)       # Overlapping spans → keep most specific

# Suppress low-value errors and filter NER entities
self._suppress_low_value_syllable_errors(errors, text=normalized_text)
self._suppress_low_value_syntax_errors(errors, text=normalized_text)
self._filter_ner_entities(errors, normalized_text)

Next Steps