Skip to main content
Text flows through four validation layers in sequence — syllable structure, word lookup, grammar rules, and context probability — with each layer catching a different class of errors before passing clean tokens downstream. This page details every stage, from pre-processing through error aggregation.

Pipeline Overview

  +-------------------+
  | Input Text        |
  +---------+---------+
            |
            v
  +-------------------+
  | Text Normalization|
  +---------+---------+
            |
            v
  +-------------------+
  | Syllable          |
  | Segmentation      |
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 1: Syllable |---->| Syllable Errors   |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+
  | Word Assembly     |
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 2: Word     |---->| Word Errors       |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 2.5:        |---->| Grammar Errors    |
  | Grammar Checking  |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 3: Context  |---->| Context Errors    |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+
  | Result Assembly   |
  +---------+---------+
            |
            v
  +-------------------+
  | Response          |
  +-------------------+

Pre-Processing

Text Normalization

Before validation, text is normalized:
from myspellchecker.text.normalize import normalize

normalized = normalize(text)
# - Remove zero-width characters
# - Normalize Unicode (NFC)
# - Handle Zawgyi detection/conversion
# - Normalize whitespace
Normalization steps:
  1. Zero-width removal: Remove invisible characters
  2. Unicode normalization: NFC form for consistent comparison
  3. Zawgyi handling: Detect and optionally convert legacy encoding
  4. Whitespace normalization: Consistent spacing

Syllable Segmentation

Text is broken into syllables using Myanmar orthographic rules:
from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter()
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# ["မြန်", "မာ", "နိုင်", "ငံ"]
Segmentation uses:
  • Consonant boundaries: Detect syllable starts
  • Combining character rules: Group marks correctly
  • Stacking rules: Handle complex consonant clusters

Layer 1: Syllable Validation

Purpose

Validate that each syllable is orthographically correct and exists in the dictionary.

Implementation

# Simplified pseudocode — actual constructor:
# SyllableValidator(config, segmenter, repository, symspell, syllable_rule_validator)
class SyllableValidator:
    def __init__(self, config, segmenter, repository, symspell, syllable_rule_validator):
        self.repository = repository
        self.rule_validator = syllable_rule_validator

    def validate(self, text: str) -> list[Error]:
        errors = []
        syllables = self.segmenter.segment_syllables(text)

        for syllable in syllables:
            # Check structure rules
            if self.rule_validator and not self.rule_validator.validate(syllable):
                errors.append(SyllableError(
                    text=syllable,
                    position=0,
                    suggestions=[],
                ))
                continue

            # Check dictionary
            if not self.repository.is_valid_syllable(syllable):
                errors.append(SyllableError(
                    text=syllable,
                    position=0,
                    suggestions=[],
                ))

        return errors

Rule-Based Validation

Syllable structure rules from SyllableRuleValidator:
class SyllableRuleValidator:
    """Validates Myanmar syllable structure."""

    # Valid consonants
    CONSONANTS = set("ကခဂဃငစဆဇဈညဋဌဍဎဏတထဒဓနပဖဗဘမယရလဝသဟဠအ")

    # Valid medials
    MEDIALS = {"ျ", "ြ", "ွ", "ှ"}

    # Valid vowels
    VOWELS = {"ါ", "ာ", "ိ", "ီ", "ု", "ူ", "ေ", "ဲ", "ော", "ော်"}

    def validate(self, syllable: str) -> bool:
        """Check if syllable follows valid structure."""
        # Must start with consonant or independent vowel
        if not syllable or syllable[0] not in self.CONSONANTS:
            return False

        # Check medial order
        if not self._check_medial_order(syllable):
            return False

        # Check for invalid combinations
        if self._has_invalid_combination(syllable):
            return False

        return True

Dictionary Lookup

# O(1) lookup in SQLite
is_valid = provider.is_valid_syllable("မြန်")  # True
is_valid = provider.is_valid_syllable("xyz")   # False

Error Coverage

  • Typos: ~90% caught at this layer
  • Invalid characters: 100% caught
  • Structural errors: 100% caught

Layer 2: Word Validation

Purpose

Verify that valid syllables form valid words, and provide suggestions for unknown words.

Word Assembly

Valid syllables are assembled into words using a longest-match algorithm within the word validation layer. There is no separate WordAssembler class — word assembly logic is integrated into the segmentation and validation pipeline. Word assembly uses a longest-match algorithm (implemented within the validator/segmenter, not as a standalone class):
# Conceptual algorithm (embedded in WordValidator and segmenters):
# 1. Start with all syllables
# 2. Find longest dictionary match
# 3. Record as word
# 4. Continue with remaining syllables
words = []
i = 0
while i < len(syllables):
    for length in range(len(syllables) - i, 0, -1):
        candidate = "".join(syllables[i:i+length])
        if provider.is_valid_word(candidate):
            words.append(candidate)
            i += length
            break
    else:
        words.append(syllables[i])
        i += 1

Validation Steps

For unknown words, Layer 2 performs multiple checks before generating an error:
# Simplified pseudocode — actual constructor includes reduplication_engine, compound_resolver
class WordValidator:
    def validate(self, text: str) -> list[Error]:
        errors = []
        words = self.segmenter.segment_words(text)

        for word in words:
            # Step 1: Dictionary lookup
            if self.word_repository.is_valid_word(word):
                continue

            # Step 2: SymSpell compound check (edit distance 0)
            if self._is_valid_compound(word):
                continue

            # Step 3: Reduplication validation (NEW)
            # Checks AA, AABB, ABAB, RHYME patterns
            if self._is_valid_reduplication(word):
                continue

            # Step 4: Compound synthesis via DP (NEW)
            # Splits into N+N, V+V, N+V, V+N, ADJ+N patterns
            if self._is_valid_compound_synthesis(word):
                continue

            # Step 5: Generate suggestions (incl. morpheme-level correction)
            suggestions = self.suggestion_strategy.suggest(word, context)
            errors.append(WordError(text=word, suggestions=suggestions))

        return errors

Error Coverage

  • Unknown words: 100% detected
  • Compound errors: ~80% with suggestions
  • Near-misses: ~95% with correct suggestion
  • Productive compounds: Accepted without error (N+N, V+V, etc.)
  • Productive reduplications: Accepted without error (AA, AABB, ABAB)

Layer 2.5: Grammar Checking

Purpose

Validate syntactic correctness using POS tags and grammar rules.

Implementation

Grammar checking is implemented through the SyntacticRuleChecker engine and validation strategies within ContextValidator. There is no separate GrammarChecker validator class - instead, grammar rules are applied as part of the context validation pipeline.
from myspellchecker.grammar.engine import SyntacticRuleChecker

# Grammar checking via SyntacticRuleChecker
rule_checker = SyntacticRuleChecker(provider)

# Check word sequence (POS tags are looked up internally from provider)
words = ["ကျောင်း", "သွား", "မှာ"]
errors = rule_checker.check_sequence(words)

for position, error_word, suggestion in errors:
    print(f"Position {position}: {error_word} -> {suggestion}")

Grammar Rules

Grammar rules are defined in YAML files (src/myspellchecker/rules/) and include:
  • Subject particle must follow noun
  • Object particle must follow noun/pronoun
  • Sentence should end with final particle
  • Question should have question marker
  • Aspect markers must follow verbs

Specialized Checkers

The grammar system includes specialized checkers in src/myspellchecker/grammar/checkers/:
  • AspectChecker: Validates aspect marker usage
  • ClassifierChecker: Validates classifier usage
  • CompoundChecker: Validates compound words
  • NegationChecker: Validates negation patterns
  • RegisterChecker: Validates formal/informal register consistency

Error Coverage

  • Particle errors: ~90% detected
  • Verb agreement: ~85% detected
  • Structure errors: ~80% detected

Layer 3: Context Validation

Purpose

Detect real-word errors where words are spelled correctly but used incorrectly.

N-gram Analysis

class ContextValidator(Validator):
    # Strategy-based orchestrator — coordinates validation strategies
    def __init__(self, config, segmenter, strategies=None, name_heuristic=None):
        super().__init__(config)
        self.segmenter = segmenter
        self.strategies = strategies or []  # List[ValidationStrategy]

    def validate(self, text: str) -> list[Error]:
        errors = []
        # Execute strategies in priority order:
        # - ToneValidationStrategy (10) - Tone mark disambiguation
        # - OrthographyValidationStrategy (15) - Medial order checks
        # - SyntacticValidationStrategy (20) - Grammar rules
        # - POSSequenceValidationStrategy (30) - POS patterns
        # - QuestionStructureValidationStrategy (40) - Question structure
        # - HomophoneValidationStrategy (45) - Homophone detection
        # - NgramContextValidationStrategy (50) - Statistical context
        # - ErrorDetectionStrategy (65) - AI token classification (ONNX)
        # - SemanticValidationStrategy (70) - AI-powered (optional)
        for strategy in sorted(self.strategies, key=lambda s: s.priority()):
            strategy_errors = strategy.validate(context)
            errors.extend(strategy_errors)

        return errors

Semantic Verification

For ambiguous cases, semantic checking provides deeper analysis:
# N-gram says "သွား" is unlikely after "ထမင်း"
# Semantic checker confirms: "စား" (0.85) >> "သွား" (0.03)

Error Coverage

  • Real-word errors: ~85% detected
  • Context misuse: ~80% detected
  • Homograph disambiguation: ~90% with semantic

Pipeline Configuration

Validation Levels

Validation level is specified per-check, not in configuration:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel

checker = SpellChecker()

# Fast: Layer 1 only (syllable validation)
result = checker.check(text, level=ValidationLevel.SYLLABLE)

# Standard: Layers 1-2 (syllable + word validation)
result = checker.check(text, level=ValidationLevel.WORD)

# Full: All layers (with context checking enabled in config)
config = SpellCheckerConfig(
    use_context_checker=True,  # Enable Layer 3 context validation
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)

Layer Enable/Disable

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel

config = SpellCheckerConfig(
    use_context_checker=True,      # Enable Layer 3 context checking
    use_rule_based_validation=True, # Enable grammar rules
    # Semantic checking is disabled by default (no model_path configured)
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)

Performance Characteristics

LayerTimeAccuracyCoverage
Syllable<10ms95%90%
Word~50ms98%5-8%
Grammar~50ms90%Grammar-only
Context~100ms85%5-10%
Semantic~200ms95%Verification

Error Aggregation

All layer errors are combined into the final Response object. There is no separate ResultAssembler class — error aggregation is handled directly by SpellChecker when assembling results from each validation layer. The aggregation process:
  1. Collect errors from each layer (syllable, word, grammar, context)
  2. Sort by position in the original text
  3. Deduplicate errors at the same position
  4. Return a Response containing the merged error list
# Error aggregation logic (embedded in SpellChecker, not a standalone class):
def assemble(
        self,
        syllable_errors: list[SyllableError],
        word_errors: list[WordError],
        grammar_errors: list[GrammarError],
        context_errors: list[ContextError],
    ) -> Response:
        all_errors = []

        # Add errors (all inherit from Error base class)
        all_errors.extend(syllable_errors)
        all_errors.extend(word_errors)

        # ... etc

        # Sort by position
        all_errors.sort(key=lambda e: e.position)

        # Remove duplicates (same position)
        unique_errors = self.deduplicate(all_errors)

        return Response(errors=unique_errors)

Next Steps