Validation Pipeline - mySpellChecker

Text flows through four validation layers in sequence — syllable structure, word lookup, grammar rules, and context probability — with each layer catching a different class of errors before passing clean tokens downstream. This page details every stage, from pre-processing through error aggregation.

Pipeline Overview

  +-------------------+
  | Input Text        |
  +---------+---------+
            |
            v
  +-------------------+
  | Text Normalization|
  +---------+---------+
            |
            v
  +-------------------+
  | Syllable          |
  | Segmentation      |
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 1: Syllable |---->| Syllable Errors   |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+
  | Word Assembly     |
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 2: Word     |---->| Word Errors       |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 2.5:        |---->| Grammar Errors    |
  | Grammar Checking  |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+     +-------------------+
  | Layer 3: Context  |---->| Context Errors    |
  | Validation        |     +-------------------+
  +---------+---------+
            |
            v
  +-------------------+
  | Result Assembly   |
  +---------+---------+
            |
            v
  +-------------------+
  | Response          |
  +-------------------+

Pre-Processing

Text Normalization

Before validation, text is normalized:

from myspellchecker.text.normalize import normalize

normalized = normalize(text)
# - Remove zero-width characters
# - Normalize Unicode (NFC)
# - Handle Zawgyi detection/conversion
# - Normalize whitespace

Normalization steps:

Zero-width removal: Remove invisible characters
Unicode normalization: NFC form for consistent comparison
Zawgyi handling: Detect and optionally convert legacy encoding
Whitespace normalization: Consistent spacing

Syllable Segmentation

Text is broken into syllables using Myanmar orthographic rules:

from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter()
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# ["မြန်", "မာ", "နိုင်", "ငံ"]

Segmentation uses:

Consonant boundaries: Detect syllable starts
Combining character rules: Group marks correctly
Stacking rules: Handle complex consonant clusters

Layer 1: Syllable Validation

Purpose

Validate that each syllable is orthographically correct and exists in the dictionary.

Implementation

# Simplified pseudocode — actual constructor:
# SyllableValidator(config, segmenter, repository, symspell, syllable_rule_validator)
class SyllableValidator:
    def __init__(self, config, segmenter, repository, symspell, syllable_rule_validator):
        self.repository = repository
        self.rule_validator = syllable_rule_validator

    def validate(self, text: str) -> list[Error]:
        errors = []
        syllables = self.segmenter.segment_syllables(text)

        for syllable in syllables:
            # Check structure rules
            if self.rule_validator and not self.rule_validator.validate(syllable):
                errors.append(SyllableError(
                    text=syllable,
                    position=0,
                    suggestions=[],
                ))
                continue

            # Check dictionary
            if not self.repository.is_valid_syllable(syllable):
                errors.append(SyllableError(
                    text=syllable,
                    position=0,
                    suggestions=[],
                ))

        return errors

Rule-Based Validation

Syllable structure rules from SyllableRuleValidator:

class SyllableRuleValidator:
    """Validates Myanmar syllable structure."""

    # Valid consonants
    CONSONANTS = set("ကခဂဃငစဆဇဈညဋဌဍဎဏတထဒဓနပဖဗဘမယရလဝသဟဠအ")

    # Valid medials
    MEDIALS = {"ျ", "ြ", "ွ", "ှ"}

    # Valid vowels
    VOWELS = {"ါ", "ာ", "ိ", "ီ", "ု", "ူ", "ေ", "ဲ", "ော", "ော်"}

    def validate(self, syllable: str) -> bool:
        """Check if syllable follows valid structure."""
        # Must start with consonant or independent vowel
        if not syllable or syllable[0] not in self.CONSONANTS:
            return False

        # Check medial order
        if not self._check_medial_order(syllable):
            return False

        # Check for invalid combinations
        if self._has_invalid_combination(syllable):
            return False

        return True

Dictionary Lookup

# O(1) lookup in SQLite
is_valid = provider.is_valid_syllable("မြန်")  # True
is_valid = provider.is_valid_syllable("xyz")   # False

Error Coverage

Typos: ~90% caught at this layer
Invalid characters: 100% caught
Structural errors: 100% caught

Layer 2: Word Validation

Purpose

Verify that valid syllables form valid words, and provide suggestions for unknown words.

Word Assembly

Valid syllables are assembled into words using a longest-match algorithm within the word validation layer. There is no separate WordAssembler class — word assembly logic is integrated into the segmentation and validation pipeline. Word assembly uses a longest-match algorithm (implemented within the validator/segmenter, not as a standalone class):

# Conceptual algorithm (embedded in WordValidator and segmenters):
# 1. Start with all syllables
# 2. Find longest dictionary match
# 3. Record as word
# 4. Continue with remaining syllables
words = []
i = 0
while i < len(syllables):
    for length in range(len(syllables) - i, 0, -1):
        candidate = "".join(syllables[i:i+length])
        if provider.is_valid_word(candidate):
            words.append(candidate)
            i += length
            break
    else:
        words.append(syllables[i])
        i += 1

Validation Steps

For unknown words, Layer 2 performs multiple checks before generating an error:

# Simplified pseudocode — actual constructor includes reduplication_engine, compound_resolver
class WordValidator:
    def validate(self, text: str) -> list[Error]:
        errors = []
        words = self.segmenter.segment_words(text)

        for word in words:
            # Step 1: Dictionary lookup
            if self.word_repository.is_valid_word(word):
                continue

            # Step 2: SymSpell compound check (edit distance 0)
            if self._is_valid_compound(word):
                continue

            # Step 3: Reduplication validation (NEW)
            # Checks AA, AABB, ABAB, RHYME patterns
            if self._is_valid_reduplication(word):
                continue

            # Step 4: Compound synthesis via DP (NEW)
            # Splits into N+N, V+V, N+V, V+N, ADJ+N patterns
            if self._is_valid_compound_synthesis(word):
                continue

            # Step 5: Generate suggestions (incl. morpheme-level correction)
            suggestions = self.suggestion_strategy.suggest(word, context)
            errors.append(WordError(text=word, suggestions=suggestions))

        return errors

Error Coverage

Unknown words: 100% detected
Compound errors: ~80% with suggestions
Near-misses: ~95% with correct suggestion
Productive compounds: Accepted without error (N+N, V+V, etc.)
Productive reduplications: Accepted without error (AA, AABB, ABAB)

Layer 2.5: Grammar Checking

Purpose

Validate syntactic correctness using POS tags and grammar rules.

Implementation

Grammar checking is implemented through the SyntacticRuleChecker engine and validation strategies within ContextValidator. There is no separate GrammarChecker validator class - instead, grammar rules are applied as part of the context validation pipeline.

from myspellchecker.grammar.engine import SyntacticRuleChecker

# Grammar checking via SyntacticRuleChecker
rule_checker = SyntacticRuleChecker(provider)

# Check word sequence (POS tags are looked up internally from provider)
words = ["ကျောင်း", "သွား", "မှာ"]
errors = rule_checker.check_sequence(words)

for position, error_word, suggestion in errors:
    print(f"Position {position}: {error_word} -> {suggestion}")

Grammar Rules

Grammar rules are defined in YAML files (src/myspellchecker/rules/) and include:

Subject particle must follow noun
Object particle must follow noun/pronoun
Sentence should end with final particle
Question should have question marker
Aspect markers must follow verbs

Specialized Checkers

The grammar system includes specialized checkers in src/myspellchecker/grammar/checkers/:

AspectChecker: Validates aspect marker usage
ClassifierChecker: Validates classifier usage
CompoundChecker: Validates compound words
NegationChecker: Validates negation patterns
RegisterChecker: Validates formal/informal register consistency

Error Coverage

Particle errors: ~90% detected
Verb agreement: ~85% detected
Structure errors: ~80% detected

Layer 3: Context Validation

Purpose

Detect real-word errors where words are spelled correctly but used incorrectly.

N-gram Analysis

class ContextValidator(Validator):
    # Strategy-based orchestrator — coordinates validation strategies
    def __init__(self, config, segmenter, strategies=None, name_heuristic=None):
        super().__init__(config)
        self.segmenter = segmenter
        self.strategies = strategies or []  # List[ValidationStrategy]

    def validate(self, text: str) -> list[Error]:
        errors = []
        # Execute strategies in priority order:
        # - ToneValidationStrategy (10) - Tone mark disambiguation
        # - OrthographyValidationStrategy (15) - Medial order checks
        # - SyntacticValidationStrategy (20) - Grammar rules
        # - POSSequenceValidationStrategy (30) - POS patterns
        # - QuestionStructureValidationStrategy (40) - Question structure
        # - HomophoneValidationStrategy (45) - Homophone detection
        # - NgramContextValidationStrategy (50) - Statistical context
        # - ErrorDetectionStrategy (65) - AI token classification (ONNX)
        # - SemanticValidationStrategy (70) - AI-powered (optional)
        for strategy in sorted(self.strategies, key=lambda s: s.priority()):
            strategy_errors = strategy.validate(context)
            errors.extend(strategy_errors)

        return errors

Semantic Verification

For ambiguous cases, semantic checking provides deeper analysis:

# N-gram says "သွား" is unlikely after "ထမင်း"
# Semantic checker confirms: "စား" (0.85) >> "သွား" (0.03)

Error Coverage

Real-word errors: ~85% detected
Context misuse: ~80% detected
Homograph disambiguation: ~90% with semantic

Pipeline Configuration

Validation Levels

Validation level is specified per-check, not in configuration:

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel

checker = SpellChecker()

# Fast: Layer 1 only (syllable validation)
result = checker.check(text, level=ValidationLevel.SYLLABLE)

# Standard: Layers 1-2 (syllable + word validation)
result = checker.check(text, level=ValidationLevel.WORD)

# Full: All layers (with context checking enabled in config)
config = SpellCheckerConfig(
    use_context_checker=True,  # Enable Layer 3 context validation
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)

Layer Enable/Disable

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.constants import ValidationLevel

config = SpellCheckerConfig(
    use_context_checker=True,      # Enable Layer 3 context checking
    use_rule_based_validation=True, # Enable grammar rules
    # Semantic checking is disabled by default (no model_path configured)
)
checker = SpellChecker(config=config)
result = checker.check(text, level=ValidationLevel.WORD)

Performance Characteristics

Layer	Time	Accuracy	Coverage
Syllable	<10ms	95%	90%
Word	~50ms	98%	5-8%
Grammar	~50ms	90%	Grammar-only
Context	~100ms	85%	5-10%
Semantic	~200ms	95%	Verification

Error Aggregation

All layer errors are combined into the final Response object. There is no separate ResultAssembler class — error aggregation is handled directly by SpellChecker when assembling results from each validation layer. The aggregation process:

Collect errors from each layer (syllable, word, grammar, context)
Sort by position in the original text
Deduplicate errors at the same position
Return a Response containing the merged error list

# Error aggregation logic (embedded in SpellChecker, not a standalone class):
def assemble(
        self,
        syllable_errors: list[SyllableError],
        word_errors: list[WordError],
        grammar_errors: list[GrammarError],
        context_errors: list[ContextError],
    ) -> Response:
        all_errors = []

        # Add errors (all inherit from Error base class)
        all_errors.extend(syllable_errors)
        all_errors.extend(word_errors)

        # ... etc

        # Sort by position
        all_errors.sort(key=lambda e: e.position)

        # Remove duplicates (same position)
        unique_errors = self.deduplicate(all_errors)

        return Response(errors=unique_errors)

Next Steps

Layer 1 Details - Syllable validation
Layer 2 Details - SymSpell algorithm
Layer 3 Details - N-gram context
Performance Tuning - Optimization

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Pipeline Overview

​Pre-Processing

​Text Normalization

​Syllable Segmentation

​Layer 1: Syllable Validation

​Purpose

​Implementation

​Rule-Based Validation

​Dictionary Lookup

​Error Coverage

​Layer 2: Word Validation

​Purpose

​Word Assembly

​Validation Steps

​Error Coverage

​Layer 2.5: Grammar Checking

​Purpose

​Implementation

​Grammar Rules

​Specialized Checkers

​Error Coverage

​Layer 3: Context Validation

​Purpose

​N-gram Analysis

​Semantic Verification

​Error Coverage

​Pipeline Configuration

​Validation Levels

​Layer Enable/Disable

​Performance Characteristics

​Error Aggregation

​Next Steps

Pipeline Overview

Pre-Processing

Text Normalization

Syllable Segmentation

Layer 1: Syllable Validation

Purpose

Implementation

Rule-Based Validation

Dictionary Lookup

Error Coverage

Layer 2: Word Validation

Purpose

Word Assembly

Validation Steps

Error Coverage

Layer 2.5: Grammar Checking

Purpose

Implementation

Grammar Rules

Specialized Checkers

Error Coverage

Layer 3: Context Validation

Purpose

N-gram Analysis

Semantic Verification

Error Coverage

Pipeline Configuration

Validation Levels

Layer Enable/Disable

Performance Characteristics

Error Aggregation

Next Steps