Skip to main content
The validation pipeline is composed of independent strategies, each targeting a specific error type, from tone mark disambiguation to AI-powered semantic analysis. Strategies execute in priority order and share context so later strategies can skip positions already flagged by earlier ones.

Overview

The validation pipeline processes text through multiple strategies, each checking for different error types:
StrategyPriorityPurposeError Type
ToneValidationStrategy10Tone mark disambiguationtone_ambiguity
OrthographyValidationStrategy15Medial order and compatibilitymedial_order_error
SyntacticValidationStrategy20Grammar rule checkingsyntax_error
StatisticalConfusableStrategy24Bigram-based confusable detectionconfusable_error
BrokenCompoundStrategy25Wrongly split compound wordsbroken_compound
POSSequenceValidationStrategy30POS sequence validationpos_sequence_error
QuestionStructureValidationStrategy40Question structurequestion_structure
HomophoneValidationStrategy45Homophone detectionhomophone_error
ConfusableCompoundClassifierStrategy47MLP-based confusable/compound detection (opt-in)broken_compound
ConfusableSemanticStrategy48MLM-enhanced confusable detection (opt-in)confusable_error
NgramContextValidationStrategy50N-gram probabilitycontext_probability
SemanticValidationStrategy70AI-powered semantic (opt-in)semantic_error
Lower priority values run first.

Fast-Path Exit

When enable_fast_path is True (the default), the pipeline uses a two-phase execution model:
  1. Structural phase (priority ≤ 25): Tone, Orthography, Syntactic, and BrokenCompound strategies always run.
  2. Contextual phase (priority > 25): POS sequence, Question, Homophone, Confusable, N-gram, and Semantic strategies only run if the structural phase found at least one error.
This dramatically reduces false positives on clean text — most sentences have no structural errors, and the contextual strategies are the primary source of false positives.

Configuration

from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig

# Default: fast-path enabled (lower FPR, may miss context-only errors)
config = SpellCheckerConfig()

# Disable fast-path for maximum recall
config = SpellCheckerConfig(
    validation=ValidationConfig(enable_fast_path=False)
)

Trade-offs

SettingFPRRecallUse Case
enable_fast_path=TrueLower (~43% on clean text)May miss context-only errorsProduction, real-time checking
enable_fast_path=FalseHigherFull recallResearch, maximum accuracy
The fast-path cutoff is at priority 25 (after BrokenCompoundStrategy). Strategies at priority 30+ (POS sequence, homophone, confusable, n-gram, semantic) are skipped on structurally clean sentences. If you need full contextual validation on all input, set enable_fast_path=False.

ValidationContext

All strategies receive a shared ValidationContext containing sentence-level information:
from myspellchecker.core.validation_strategies.base import ValidationContext

context = ValidationContext(
    sentence="သူ သွား ကျောင်း",
    words=["သူ", "သွား", "ကျောင်း"],
    word_positions=[0, 6, 15],
    is_name_mask=[False, False, False],
    existing_errors={},  # Maps position -> error_type from previous strategies
    sentence_type="statement",  # statement, question, command
    pos_tags=["PRON", "V", "N"]  # POS tags if available
)

Context Attributes

AttributeTypeDescription
sentencestrFull original sentence
wordsList[str]Tokenized words
word_positionsList[int]Character position of each word
is_name_maskList[bool]True if word is a proper name
existing_errorsdict[int, str]Maps word position to error_type from previous strategies
existing_suggestionsdict[int, list[str]]Suggestions from the strategy that first flagged each position
existing_confidencesdict[int, float]Confidence scores of first-flagged errors
sentence_typestrSentence type for context
pos_tagsList[str]POS tags (if available)
full_textstrThe full text being checked (not just the sentence)
global_error_countintTracks error count globally across sentences

Strategy Implementations

ToneValidationStrategy (Priority: 10)

Handles tone mark disambiguation using context. Accepts an optional provider for word frequency lookup to suppress ambiguous high-frequency forms.
from myspellchecker.core.validation_strategies.tone_strategy import ToneValidationStrategy
from myspellchecker.text.tone import ToneDisambiguator

disambiguator = ToneDisambiguator()
strategy = ToneValidationStrategy(
    tone_disambiguator=disambiguator,
    provider=provider,             # Optional: for frequency-based suppression
    confidence_threshold=0.5,      # Minimum confidence to report error
)

errors = strategy.validate(context)
Detection:
  • Missing tone marks (ငါ → ငါး in number context)
  • Wrong tone marks based on context
  • Ambiguous words resolved by surrounding words
Frequency-based suppression: When both the original word and the correction are high-frequency (above high_freq_threshold), the error is suppressed. This prevents false positives on grammatically ambiguous forms like သူ့ (possessive) vs သူ (subject) where both are valid.

OrthographyValidationStrategy (Priority: 15)

Validates medial consonant ordering and compatibility (UTN #11 rules) at the word level. Uses a two-step check: medial order first, then compatibility. Accepts an optional provider for sorting suggestions by dictionary validity.
from myspellchecker.core.validation_strategies.orthography_strategy import OrthographyValidationStrategy

strategy = OrthographyValidationStrategy(
    provider=provider,    # Optional: sort suggestions by validity
    confidence=0.9,       # Default confidence for orthography errors
)
Detection:
  • Medial order errors: Incorrect medial consonant order (e.g., ွ before ျ), which generates stripped variant suggestions
  • Compatibility errors: Incompatible medial-consonant combinations with no suggestions, because the combination is invalid

SyntacticValidationStrategy (Priority: 20)

Validates grammar rules and particle usage.
from myspellchecker.core.validation_strategies.syntactic_strategy import SyntacticValidationStrategy

strategy = SyntacticValidationStrategy(
    syntactic_rule_checker=syntactic_checker,
    confidence=0.80
)

errors = strategy.validate(context)
Detection:
  • Particle errors (မှာ vs မှ)
  • Medial confusion (ျ vs ြ)
  • Missing particles
  • Invalid word combinations
  • Duplicated sentence endings (e.g., သည်သည်), detected via fast-path before full syntactic check
  • Split polite forms (ပါတယ် → ပါ + တယ်), automatically skipped to avoid false positives

BrokenCompoundStrategy (Priority: 25)

Detects compound words that were incorrectly split by a space. This is the inverse of merged word detection — instead of finding words that were wrongly joined, it finds words that were wrongly separated.
from myspellchecker.core.validation_strategies.broken_compound_strategy import BrokenCompoundStrategy

strategy = BrokenCompoundStrategy(
    provider=provider,
    rare_threshold=2000,          # Max frequency for a word to be "rare"
    compound_min_frequency=5000,  # Min frequency for the compound form
    compound_ratio=5.0,           # Min ratio of compound_freq / rare_word_freq
    confidence=0.8
)

errors = strategy.validate(context)
Parameters:
ParameterTypeDefaultDescription
providerWordRepositoryrequiredWord repository with is_valid_word and get_word_frequency
rare_thresholdint2000Maximum frequency for a word to be considered “rare”
compound_min_frequencyint5000Minimum frequency for the compound to be flagged
compound_ratiofloat5.0Minimum ratio of compound frequency to rare word frequency
confidencefloat0.8Confidence score for broken compound errors
Detection:
  • Adjacent word pairs whose concatenation forms a valid, common dictionary word
  • At least one component must be a rare word (below rare_threshold)
  • The compound form must be significantly more common than the rarer component
  • Skips Pali/Sanskrit stacking fragments (virama U+1039) to avoid false positives
Example: “မနက် ဖြန်” (wrongly split) is flagged because “မနက်ဖြန်” (tomorrow) is a valid compound that is much more common than the rare component “ဖြန်”.

POSSequenceValidationStrategy (Priority: 30)

Validates POS tag sequences against expected patterns.
from myspellchecker.core.validation_strategies.pos_sequence_strategy import POSSequenceValidationStrategy

strategy = POSSequenceValidationStrategy(
    viterbi_tagger=pos_tagger,
    pos_disambiguator=disambiguator,  # Optional: resolves multi-POS tags using R1-R5 rules
    confidence=0.70,
)

errors = strategy.validate(context)
Detection:
  • P-P: Consecutive particles → error (always flagged)
  • N-N: Consecutive nouns without particle → warning (logged, not surfaced as error)
  • V-V: Consecutive verbs → info (serial verb constructions are usually valid)
  • V+N / N+V multi-POS check: When a noun also has V in its dictionary POS, validates context
  • Sentence-final predicate check: Flags sentences with structural particles but no verb, suggests ဖြစ်သည် or ဖြစ်ပါသည်
POS disambiguation: When tags contain | (multi-POS), the optional pos_disambiguator resolves them using context-based R1-R5 rules before validation. Disambiguated tags are stored back in context for downstream strategies. Serial Verb Support: Myanmar is a serial verb language where verb-verb (V-V) sequences are often valid. The strategy recognizes valid serial verb constructions:
  • Auxiliary verbs: နေ (progressive), ထား (resultative), လိုက် (action manner)
  • Modal verbs: နိုင် (ability), ချင် (desire), ရ (permission)
  • Directional verbs: သွား (away), လာ (toward)
# "စားသွား" (eat+go = go eat) is a valid V-V sequence
# Strategy checks is_valid_verb_sequence() before flagging V-V as error

QuestionStructureValidationStrategy (Priority: 40)

Validates question sentence structure.
from myspellchecker.core.validation_strategies.question_strategy import QuestionStructureValidationStrategy

strategy = QuestionStructureValidationStrategy(
    confidence=0.75
)

errors = strategy.validate(context)
Detection:
  • Missing question particles (လား, သလဲ)
  • Wrong question particle for context
  • Question word agreement
  • Implicit questions: 2nd-person pronouns + completive endings detected as implicit questions (lower confidence ~0.55)
  • Malformed question endings: Split ရဲ့ လဲ tokens merged and corrected
  • Segmentation fragment filtering: Question words adjacent to previous word (no space gap) are masked to prevent false positives
Enclitic Question Particles: The strategy detects question particles attached directly to verbs (enclitics):
# "သွားလား" (go+question = did you go?) is recognized as a proper question
# No error generated for verb+particle combinations
Negative Indefinite Handling: The strategy correctly identifies negative indefinite constructions as statements, not questions:
# "ဘယ်သူမှ မလာဘူး" = "Nobody came" (statement, NOT question)
# Question word + "မှ" suffix + negative verb = statement pattern

ConfusableSemanticStrategy (Priority: 48), Opt-in Required

MLM-enhanced confusable detection that uses masked language modeling to catch valid-word confusables. Dynamically generates confusable variants using phonetic rules (aspiration swaps, medial swaps, tone mark changes, nasal endings) and uses MLM logits to determine if a variant is more likely in context. Requires a trained ONNX model.
from myspellchecker.core.validation_strategies.confusable_semantic_strategy import ConfusableSemanticStrategy

strategy = ConfusableSemanticStrategy(
    semantic_checker=semantic_checker,
    provider=provider,
    confidence=0.80,
    top_k=50,
    logit_diff_threshold=3.0,
    logit_diff_threshold_medial=2.0,
    logit_diff_threshold_current_in_topk=5.0,
    high_freq_threshold=50000,
    high_freq_logit_diff=6.0,
    min_word_length=2
)

errors = strategy.validate(context)
Parameters:
ParameterTypeDefaultDescription
semantic_checkerSemanticCheckerrequiredSemanticChecker with loaded ONNX model
providerNgramRepositoryrequiredProvider with word lookup and frequency data
confidencefloat0.80Confidence score for confusable errors
top_kint50Number of top MLM predictions to consider
logit_diff_thresholdfloat3.0Default logit difference threshold
logit_diff_threshold_medialfloat2.0Lower threshold for ျ↔ြ medial swaps
logit_diff_threshold_current_in_topkfloat5.0Stricter threshold when current word is in top-K
high_freq_thresholdint50000Frequency above which stricter thresholds apply
high_freq_logit_difffloat6.0Logit diff threshold for high-frequency words
min_word_lengthint2Minimum word length to check
freq_ratio_penalty_highfloat2.0Additive penalty when variant/word frequency ratio exceeds 5x
freq_ratio_penalty_midfloat1.0Additive penalty when ratio exceeds 2x
visarga_penaltyfloat2.0Additive penalty for visarga-only pairs
sentence_final_penaltyfloat0.5Additive penalty for sentence-final position
Asymmetric thresholds protect against false positives with stacking penalties:
  • Base threshold (highest wins): high-frequency word (6.0), current in top-K (5.0), medial ျ↔ြ swap (2.0), default (3.0)
  • Additive penalties: frequency-ratio (+2.0 or +1.0), visarga-pair (+2.0), sentence-final (+0.5)
Detection:
  • Generates confusable variants dynamically from phonetic rules (aspiration swaps, medial swaps, tone marks, nasal endings)
  • Uses a single predict_mask() call per candidate word to compare MLM logits
  • Skips positions already flagged by earlier strategies
  • High-frequency visarga pairs (both words above threshold) are hard-blocked to prevent false positives

NgramContextValidationStrategy (Priority: 50)

Uses bigram/trigram probabilities to detect unlikely sequences.
from myspellchecker.core.validation_strategies.ngram_strategy import NgramContextValidationStrategy

strategy = NgramContextValidationStrategy(
    context_checker=ngram_checker,
    provider=provider,
    confidence_high=0.75,
    confidence_low=0.6,
    max_suggestions=5,
    edit_distance=2
)

errors = strategy.validate(context)
Detection:
  • Low probability word pairs
  • Unusual word combinations
  • Real-word errors (correct spelling, wrong context)

HomophoneValidationStrategy (Priority: 45)

Detects homophone confusion based on context.
from myspellchecker.core.validation_strategies.homophone_strategy import HomophoneValidationStrategy

strategy = HomophoneValidationStrategy(
    homophone_checker=homophone_checker,
    provider=ngram_provider,
    context_checker=context_checker,  # NgramContextChecker instance
    confidence=0.80,
)

errors = strategy.validate(context)
Parameters:
ParameterTypeDefaultDescription
homophone_checkerHomophoneChecker or NonerequiredHomophoneChecker instance; if None, strategy is disabled
providerNgramRepositoryrequiredProvider for N-gram probability lookups
context_checkerNgramContextChecker or NoneNoneNgramContextChecker that performs N-gram comparison via check_word_in_context()
confidencefloat0.8Confidence score for homophone errors
Legacy kwargs (improvement_ratio, min_probability, high_freq_threshold, high_freq_improvement_ratio) are accepted but ignored for backward compatibility. These thresholds are managed internally by NgramContextChecker.compute_required_ratio().
Detection:
  • Homophone pairs (ကား/ကာ, သာ/သား)
  • Context-based correct form selection
  • Sound-alike word confusion

SemanticValidationStrategy (Priority: 70), Opt-in Required

AI-powered validation using ONNX models. This strategy is not active by default. You must train a semantic model first, then configure SemanticConfig with the model path and set use_proactive_scanning=True.
from myspellchecker.core.validation_strategies.semantic_strategy import SemanticValidationStrategy

strategy = SemanticValidationStrategy(
    semantic_checker=semantic_checker,
    provider=provider,                 # DictionaryProvider for word lookups
    use_proactive_scanning=True,       # Must be True to enable — False by default
    proactive_confidence_threshold=0.85,
    min_word_length=2,
)

errors = strategy.validate(context)
use_proactive_scanning defaults to False. Without setting it to True, this strategy produces no errors even if a semantic model is loaded. Both a trained model and use_proactive_scanning=True are required.
Parameters:
ParameterTypeDefaultDescription
semantic_checkerSemanticChecker or NonerequiredSemanticChecker with loaded ONNX model; if None, strategy is disabled
providerDictionaryProviderrequiredProvider for word frequency and validity lookups
use_proactive_scanningboolFalseEnable proactive semantic scanning. Must be True for this strategy to do anything
proactive_confidence_thresholdfloat0.85Minimum confidence to report semantic errors
min_word_lengthint2Minimum word length for semantic analysis
Detection (two independent sub-checks):
  1. Proactive semantic scan: Masks each word and checks if MLM predictions disagree strongly with the original, limited to 8 predict_mask() calls per sentence
  2. Animacy detection: Flags inanimate subjects before subject/topic particles (က, ကို, သည်, မှာ, တွင်) and always runs even when proactive scanning is skipped
Error budget optimization: Proactive scanning is automatically skipped when there are already errors in the context (from earlier strategies), preventing cascade false positives from corrupted MLM context. Animacy detection is unaffected and always runs. Skipped words: Common function words (particles and conjunctions, 22 words total) are excluded from proactive scanning as MLM disagreement on these is noise.

Creating Custom Strategies

Implement the ValidationStrategy abstract base class:
from myspellchecker.core.validation_strategies.base import (
    ValidationStrategy,
    ValidationContext
)
from myspellchecker.core.response import Error, ContextError

class CustomValidationStrategy(ValidationStrategy):
    """Custom validation strategy."""

    def __init__(self, config: dict):
        self.config = config

    def validate(self, context: ValidationContext) -> list[Error]:
        """Validate and return errors."""
        errors = []

        for i, word in enumerate(context.words):
            # Skip if already has an error
            if context.word_positions[i] in context.existing_errors:
                continue

            # Skip proper names
            if i < len(context.is_name_mask) and context.is_name_mask[i]:
                continue

            # Your validation logic
            if self._is_invalid(word, context):
                errors.append(ContextError(
                    text=word,
                    position=context.word_positions[i],
                    error_type="custom_error",
                    suggestions=self._get_suggestions(word),
                    confidence=0.80,
                    probability=0.0,
                    prev_word=context.words[i-1] if i > 0 else ""
                ))

                # Mark as having error (existing_errors is a dict[int, str])
                context.existing_errors[context.word_positions[i]] = "custom_error"

        return errors

    def priority(self) -> int:
        """Return priority (lower runs first)."""
        return 45  # Between POS and N-gram

    def _is_invalid(self, word: str, context: ValidationContext) -> bool:
        # Implement validation logic
        return False

    def _get_suggestions(self, word: str) -> list[str]:
        # Generate suggestions
        return []

Strategy Composition

In the default pipeline, SpellChecker coordinates validation directly through its validators:
  1. SyllableValidator: validates each syllable (layer 1)
  2. WordValidator: validates words via SymSpell (layer 2)
  3. ContextValidator: orchestrates validation strategies (layer 3)
The ContextValidator receives a list of strategies built by SpellCheckerBuilder and executes them in priority order within each sentence.
from myspellchecker.core.builder import SpellCheckerBuilder

# Builder wires strategies automatically based on config
checker = SpellCheckerBuilder().with_config(config).with_provider(provider).build()
result = checker.check("မြန်မာ စာ")

Execution Order

  1. Strategies are sorted by priority (ascending)
  2. Each strategy receives the shared ValidationContext
  3. Strategies can check existing_errors to skip already-flagged words
  4. Strategies add their flagged positions to existing_errors
  5. Errors from all strategies are collected and returned

Configuration

Enable/disable strategies via configuration:
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_context_checker=True,  # Enable N-gram strategy
    use_phonetic=True,         # Enable homophone detection
    validation=ValidationConfig(
        use_homophone_detection=True,      # Toggle homophone strategy (default: True)
        use_orthography_validation=True,   # Toggle orthography strategy (default: True)
        enable_strategy_timing=False,      # Per-strategy timing at DEBUG level (default: False)
    ),
    # Semantic config enables semantic strategy (opt-in, requires trained model)
    semantic=SemanticConfig(
        model_path="./my-model/model.onnx",       # Your trained model
        use_proactive_scanning=True,
    ),
)

Error Types

Each strategy produces specific error types:
Error TypeStrategyDescription
tone_ambiguityToneTone mark disambiguation
medial_order_errorOrthographyMedial consonant order/compatibility
syntax_errorSyntacticGrammar rule violation
broken_compoundBrokenCompoundWrongly split compound word
pos_sequence_errorPOSInvalid POS sequence (P-P)
question_structureQuestionQuestion structure issue
homophone_errorHomophoneSound-alike confusion
confusable_errorConfusableSemanticPhonetic/visual confusable (MLM-based)
context_probabilityN-gramLow probability sequence
semantic_errorSemanticAI-detected anomaly (opt-in)

Best Practices

  1. Priority Selection: Choose priorities that make sense for your validation order
  2. Skip Flagged Words: Always check existing_errors to avoid duplicate errors
  3. Skip Names: Respect the is_name_mask to avoid flagging proper names
  4. Confidence Scores: Use appropriate confidence levels for your error type
  5. Performance: Heavy validations (semantic) should run last

See Also