Validation Strategies - mySpellChecker

The validation pipeline is composed of independent strategies, each targeting a specific error type, from tone mark disambiguation to AI-powered semantic analysis. Strategies execute in priority order and share context so later strategies can skip positions already flagged by earlier ones.

Overview

The validation pipeline processes text through multiple strategies, each checking for different error types:

Strategy	Priority	Purpose	Error Type
ToneValidationStrategy	10	Tone mark disambiguation	`tone_ambiguity`
OrthographyValidationStrategy	15	Medial order and compatibility	`medial_order_error`
SyntacticValidationStrategy	20	Grammar rule checking	`syntax_error`
StatisticalConfusableStrategy	24	Bigram-based confusable detection	`confusable_error`
BrokenCompoundStrategy	25	Wrongly split compound words	`broken_compound`
POSSequenceValidationStrategy	30	POS sequence validation	`pos_sequence_error`
QuestionStructureValidationStrategy	40	Question structure	`question_structure`
HomophoneValidationStrategy	45	Homophone detection	`homophone_error`
ConfusableCompoundClassifierStrategy	47	MLP-based confusable/compound detection (opt-in)	`broken_compound`
ConfusableSemanticStrategy	48	MLM-enhanced confusable detection (opt-in)	`confusable_error`
NgramContextValidationStrategy	50	N-gram probability	`context_probability`
SemanticValidationStrategy	70	AI-powered semantic (opt-in)	`semantic_error`

Lower priority values run first.

Fast-Path Exit

When enable_fast_path is True (the default), the pipeline uses a two-phase execution model:

Structural phase (priority ≤ 25): Tone, Orthography, Syntactic, and BrokenCompound strategies always run.
Contextual phase (priority > 25): POS sequence, Question, Homophone, Confusable, N-gram, and Semantic strategies only run if the structural phase found at least one error.

This dramatically reduces false positives on clean text — most sentences have no structural errors, and the contextual strategies are the primary source of false positives.

Configuration

from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig

# Default: fast-path enabled (lower FPR, may miss context-only errors)
config = SpellCheckerConfig()

# Disable fast-path for maximum recall
config = SpellCheckerConfig(
    validation=ValidationConfig(enable_fast_path=False)
)

Trade-offs

Setting	FPR	Recall	Use Case
`enable_fast_path=True`	Lower (~43% on clean text)	May miss context-only errors	Production, real-time checking
`enable_fast_path=False`	Higher	Full recall	Research, maximum accuracy

The fast-path cutoff is at priority 25 (after BrokenCompoundStrategy). Strategies at priority 30+ (POS sequence, homophone, confusable, n-gram, semantic) are skipped on structurally clean sentences. If you need full contextual validation on all input, set enable_fast_path=False.

ValidationContext

All strategies receive a shared ValidationContext containing sentence-level information:

from myspellchecker.core.validation_strategies.base import ValidationContext

context = ValidationContext(
    sentence="သူ သွား ကျောင်း",
    words=["သူ", "သွား", "ကျောင်း"],
    word_positions=[0, 6, 15],
    is_name_mask=[False, False, False],
    existing_errors={},  # Maps position -> error_type from previous strategies
    sentence_type="statement",  # statement, question, command
    pos_tags=["PRON", "V", "N"]  # POS tags if available
)

Context Attributes

Attribute	Type	Description
`sentence`	str	Full original sentence
`words`	List[str]	Tokenized words
`word_positions`	List[int]	Character position of each word
`is_name_mask`	List[bool]	True if word is a proper name
`existing_errors`	dict[int, str]	Maps word position to error_type from previous strategies
`existing_suggestions`	dict[int, list[str]]	Suggestions from the strategy that first flagged each position
`existing_confidences`	dict[int, float]	Confidence scores of first-flagged errors
`sentence_type`	str	Sentence type for context
`pos_tags`	List[str]	POS tags (if available)
`full_text`	str	The full text being checked (not just the sentence)
`global_error_count`	int	Tracks error count globally across sentences

Strategy Implementations

ToneValidationStrategy (Priority: 10)

Handles tone mark disambiguation using context. Accepts an optional provider for word frequency lookup to suppress ambiguous high-frequency forms.

from myspellchecker.core.validation_strategies.tone_strategy import ToneValidationStrategy
from myspellchecker.text.tone import ToneDisambiguator

disambiguator = ToneDisambiguator()
strategy = ToneValidationStrategy(
    tone_disambiguator=disambiguator,
    provider=provider,             # Optional: for frequency-based suppression
    confidence_threshold=0.5,      # Minimum confidence to report error
)

errors = strategy.validate(context)

Detection:

Missing tone marks (ငါ → ငါး in number context)
Wrong tone marks based on context
Ambiguous words resolved by surrounding words

Frequency-based suppression: When both the original word and the correction are high-frequency (above high_freq_threshold), the error is suppressed. This prevents false positives on grammatically ambiguous forms like သူ့ (possessive) vs သူ (subject) where both are valid.

OrthographyValidationStrategy (Priority: 15)

Validates medial consonant ordering and compatibility (UTN #11 rules) at the word level. Uses a two-step check: medial order first, then compatibility. Accepts an optional provider for sorting suggestions by dictionary validity.

from myspellchecker.core.validation_strategies.orthography_strategy import OrthographyValidationStrategy

strategy = OrthographyValidationStrategy(
    provider=provider,    # Optional: sort suggestions by validity
    confidence=0.9,       # Default confidence for orthography errors
)

Detection:

Medial order errors: Incorrect medial consonant order (e.g., ွ before ျ), which generates stripped variant suggestions
Compatibility errors: Incompatible medial-consonant combinations with no suggestions, because the combination is invalid

SyntacticValidationStrategy (Priority: 20)

Validates grammar rules and particle usage.

from myspellchecker.core.validation_strategies.syntactic_strategy import SyntacticValidationStrategy

strategy = SyntacticValidationStrategy(
    syntactic_rule_checker=syntactic_checker,
    confidence=0.80
)

errors = strategy.validate(context)

Detection:

Particle errors (မှာ vs မှ)
Medial confusion (ျ vs ြ)
Missing particles
Invalid word combinations
Duplicated sentence endings (e.g., သည်သည်), detected via fast-path before full syntactic check
Split polite forms (ပါတယ် → ပါ + တယ်), automatically skipped to avoid false positives

BrokenCompoundStrategy (Priority: 25)

Detects compound words that were incorrectly split by a space. This is the inverse of merged word detection — instead of finding words that were wrongly joined, it finds words that were wrongly separated.

from myspellchecker.core.validation_strategies.broken_compound_strategy import BrokenCompoundStrategy

strategy = BrokenCompoundStrategy(
    provider=provider,
    rare_threshold=2000,          # Max frequency for a word to be "rare"
    compound_min_frequency=5000,  # Min frequency for the compound form
    compound_ratio=5.0,           # Min ratio of compound_freq / rare_word_freq
    confidence=0.8
)

errors = strategy.validate(context)

Parameters:

Parameter	Type	Default	Description
`provider`	WordRepository	required	Word repository with `is_valid_word` and `get_word_frequency`
`rare_threshold`	int	2000	Maximum frequency for a word to be considered “rare”
`compound_min_frequency`	int	5000	Minimum frequency for the compound to be flagged
`compound_ratio`	float	5.0	Minimum ratio of compound frequency to rare word frequency
`confidence`	float	0.8	Confidence score for broken compound errors

Detection:

Adjacent word pairs whose concatenation forms a valid, common dictionary word
At least one component must be a rare word (below rare_threshold)
The compound form must be significantly more common than the rarer component
Skips Pali/Sanskrit stacking fragments (virama U+1039) to avoid false positives

Example: “မနက် ဖြန်” (wrongly split) is flagged because “မနက်ဖြန်” (tomorrow) is a valid compound that is much more common than the rare component “ဖြန်”.

POSSequenceValidationStrategy (Priority: 30)

Validates POS tag sequences against expected patterns.

from myspellchecker.core.validation_strategies.pos_sequence_strategy import POSSequenceValidationStrategy

strategy = POSSequenceValidationStrategy(
    viterbi_tagger=pos_tagger,
    pos_disambiguator=disambiguator,  # Optional: resolves multi-POS tags using R1-R5 rules
    confidence=0.70,
)

errors = strategy.validate(context)

Detection:

P-P: Consecutive particles → error (always flagged)
N-N: Consecutive nouns without particle → warning (logged, not surfaced as error)
V-V: Consecutive verbs → info (serial verb constructions are usually valid)
V+N / N+V multi-POS check: When a noun also has V in its dictionary POS, validates context
Sentence-final predicate check: Flags sentences with structural particles but no verb, suggests ဖြစ်သည် or ဖြစ်ပါသည်

POS disambiguation: When tags contain | (multi-POS), the optional pos_disambiguator resolves them using context-based R1-R5 rules before validation. Disambiguated tags are stored back in context for downstream strategies. Serial Verb Support: Myanmar is a serial verb language where verb-verb (V-V) sequences are often valid. The strategy recognizes valid serial verb constructions:

Auxiliary verbs: နေ (progressive), ထား (resultative), လိုက် (action manner)
Modal verbs: နိုင် (ability), ချင် (desire), ရ (permission)
Directional verbs: သွား (away), လာ (toward)

# "စားသွား" (eat+go = go eat) is a valid V-V sequence
# Strategy checks is_valid_verb_sequence() before flagging V-V as error

QuestionStructureValidationStrategy (Priority: 40)

Validates question sentence structure.

from myspellchecker.core.validation_strategies.question_strategy import QuestionStructureValidationStrategy

strategy = QuestionStructureValidationStrategy(
    confidence=0.75
)

errors = strategy.validate(context)

Detection:

Missing question particles (လား, သလဲ)
Wrong question particle for context
Question word agreement
Implicit questions: 2nd-person pronouns + completive endings detected as implicit questions (lower confidence ~0.55)
Malformed question endings: Split ရဲ့ လဲ tokens merged and corrected
Segmentation fragment filtering: Question words adjacent to previous word (no space gap) are masked to prevent false positives

Enclitic Question Particles: The strategy detects question particles attached directly to verbs (enclitics):

# "သွားလား" (go+question = did you go?) is recognized as a proper question
# No error generated for verb+particle combinations

Negative Indefinite Handling: The strategy correctly identifies negative indefinite constructions as statements, not questions:

# "ဘယ်သူမှ မလာဘူး" = "Nobody came" (statement, NOT question)
# Question word + "မှ" suffix + negative verb = statement pattern

ConfusableSemanticStrategy (Priority: 48), Opt-in Required

MLM-enhanced confusable detection that uses masked language modeling to catch valid-word confusables. Dynamically generates confusable variants using phonetic rules (aspiration swaps, medial swaps, tone mark changes, nasal endings) and uses MLM logits to determine if a variant is more likely in context. Requires a trained ONNX model.

from myspellchecker.core.validation_strategies.confusable_semantic_strategy import ConfusableSemanticStrategy

strategy = ConfusableSemanticStrategy(
    semantic_checker=semantic_checker,
    provider=provider,
    confidence=0.80,
    top_k=50,
    logit_diff_threshold=3.0,
    logit_diff_threshold_medial=2.0,
    logit_diff_threshold_current_in_topk=5.0,
    high_freq_threshold=50000,
    high_freq_logit_diff=6.0,
    min_word_length=2
)

errors = strategy.validate(context)

Parameters:

Parameter	Type	Default	Description
`semantic_checker`	SemanticChecker	required	SemanticChecker with loaded ONNX model
`provider`	NgramRepository	required	Provider with word lookup and frequency data
`confidence`	float	0.80	Confidence score for confusable errors
`top_k`	int	50	Number of top MLM predictions to consider
`logit_diff_threshold`	float	3.0	Default logit difference threshold
`logit_diff_threshold_medial`	float	2.0	Lower threshold for ျ↔ြ medial swaps
`logit_diff_threshold_current_in_topk`	float	5.0	Stricter threshold when current word is in top-K
`high_freq_threshold`	int	50000	Frequency above which stricter thresholds apply
`high_freq_logit_diff`	float	6.0	Logit diff threshold for high-frequency words
`min_word_length`	int	2	Minimum word length to check
`freq_ratio_penalty_high`	float	2.0	Additive penalty when variant/word frequency ratio exceeds 5x
`freq_ratio_penalty_mid`	float	1.0	Additive penalty when ratio exceeds 2x
`visarga_penalty`	float	2.0	Additive penalty for visarga-only pairs
`sentence_final_penalty`	float	0.5	Additive penalty for sentence-final position

Asymmetric thresholds protect against false positives with stacking penalties:

Base threshold (highest wins): high-frequency word (6.0), current in top-K (5.0), medial ျ↔ြ swap (2.0), default (3.0)
Additive penalties: frequency-ratio (+2.0 or +1.0), visarga-pair (+2.0), sentence-final (+0.5)

Detection:

Generates confusable variants dynamically from phonetic rules (aspiration swaps, medial swaps, tone marks, nasal endings)
Uses a single predict_mask() call per candidate word to compare MLM logits
Skips positions already flagged by earlier strategies
High-frequency visarga pairs (both words above threshold) are hard-blocked to prevent false positives

NgramContextValidationStrategy (Priority: 50)

Uses bigram/trigram probabilities to detect unlikely sequences.

from myspellchecker.core.validation_strategies.ngram_strategy import NgramContextValidationStrategy

strategy = NgramContextValidationStrategy(
    context_checker=ngram_checker,
    provider=provider,
    confidence_high=0.75,
    confidence_low=0.6,
    max_suggestions=5,
    edit_distance=2
)

errors = strategy.validate(context)

Detection:

Low probability word pairs
Unusual word combinations
Real-word errors (correct spelling, wrong context)

HomophoneValidationStrategy (Priority: 45)

Detects homophone confusion based on context.

from myspellchecker.core.validation_strategies.homophone_strategy import HomophoneValidationStrategy

strategy = HomophoneValidationStrategy(
    homophone_checker=homophone_checker,
    provider=ngram_provider,
    context_checker=context_checker,  # NgramContextChecker instance
    confidence=0.80,
)

errors = strategy.validate(context)

Parameters:

Parameter	Type	Default	Description
`homophone_checker`	HomophoneChecker or None	required	HomophoneChecker instance; if None, strategy is disabled
`provider`	NgramRepository	required	Provider for N-gram probability lookups
`context_checker`	NgramContextChecker or None	None	NgramContextChecker that performs N-gram comparison via `check_word_in_context()`
`confidence`	float	0.8	Confidence score for homophone errors

Legacy kwargs (improvement_ratio, min_probability, high_freq_threshold, high_freq_improvement_ratio) are accepted but ignored for backward compatibility. These thresholds are managed internally by NgramContextChecker.compute_required_ratio().

Detection:

Homophone pairs (ကား/ကာ, သာ/သား)
Context-based correct form selection
Sound-alike word confusion

SemanticValidationStrategy (Priority: 70), Opt-in Required

AI-powered validation using ONNX models. This strategy is not active by default. You must train a semantic model first, then configure SemanticConfig with the model path and set use_proactive_scanning=True.

from myspellchecker.core.validation_strategies.semantic_strategy import SemanticValidationStrategy

strategy = SemanticValidationStrategy(
    semantic_checker=semantic_checker,
    provider=provider,                 # DictionaryProvider for word lookups
    use_proactive_scanning=True,       # Must be True to enable — False by default
    proactive_confidence_threshold=0.85,
    min_word_length=2,
)

errors = strategy.validate(context)

use_proactive_scanning defaults to False. Without setting it to True, this strategy produces no errors even if a semantic model is loaded. Both a trained model and use_proactive_scanning=True are required.

Parameters:

Parameter	Type	Default	Description
`semantic_checker`	SemanticChecker or None	required	SemanticChecker with loaded ONNX model; if None, strategy is disabled
`provider`	DictionaryProvider	required	Provider for word frequency and validity lookups
`use_proactive_scanning`	bool	False	Enable proactive semantic scanning. Must be True for this strategy to do anything
`proactive_confidence_threshold`	float	0.85	Minimum confidence to report semantic errors
`min_word_length`	int	2	Minimum word length for semantic analysis

Detection (two independent sub-checks):

Proactive semantic scan: Masks each word and checks if MLM predictions disagree strongly with the original, limited to 8 predict_mask() calls per sentence
Animacy detection: Flags inanimate subjects before subject/topic particles (က, ကို, သည်, မှာ, တွင်) and always runs even when proactive scanning is skipped

Error budget optimization: Proactive scanning is automatically skipped when there are already errors in the context (from earlier strategies), preventing cascade false positives from corrupted MLM context. Animacy detection is unaffected and always runs. Skipped words: Common function words (particles and conjunctions, 22 words total) are excluded from proactive scanning as MLM disagreement on these is noise.

Creating Custom Strategies

Implement the ValidationStrategy abstract base class:

from myspellchecker.core.validation_strategies.base import (
    ValidationStrategy,
    ValidationContext
)
from myspellchecker.core.response import Error, ContextError

class CustomValidationStrategy(ValidationStrategy):
    """Custom validation strategy."""

    def __init__(self, config: dict):
        self.config = config

    def validate(self, context: ValidationContext) -> list[Error]:
        """Validate and return errors."""
        errors = []

        for i, word in enumerate(context.words):
            # Skip if already has an error
            if context.word_positions[i] in context.existing_errors:
                continue

            # Skip proper names
            if i < len(context.is_name_mask) and context.is_name_mask[i]:
                continue

            # Your validation logic
            if self._is_invalid(word, context):
                errors.append(ContextError(
                    text=word,
                    position=context.word_positions[i],
                    error_type="custom_error",
                    suggestions=self._get_suggestions(word),
                    confidence=0.80,
                    probability=0.0,
                    prev_word=context.words[i-1] if i > 0 else ""
                ))

                # Mark as having error (existing_errors is a dict[int, str])
                context.existing_errors[context.word_positions[i]] = "custom_error"

        return errors

    def priority(self) -> int:
        """Return priority (lower runs first)."""
        return 45  # Between POS and N-gram

    def _is_invalid(self, word: str, context: ValidationContext) -> bool:
        # Implement validation logic
        return False

    def _get_suggestions(self, word: str) -> list[str]:
        # Generate suggestions
        return []

Strategy Composition

In the default pipeline, SpellChecker coordinates validation directly through its validators:

SyllableValidator: validates each syllable (layer 1)
WordValidator: validates words via SymSpell (layer 2)
ContextValidator: orchestrates validation strategies (layer 3)

The ContextValidator receives a list of strategies built by SpellCheckerBuilder and executes them in priority order within each sentence.

from myspellchecker.core.builder import SpellCheckerBuilder

# Builder wires strategies automatically based on config
checker = SpellCheckerBuilder().with_config(config).with_provider(provider).build()
result = checker.check("မြန်မာ စာ")

Execution Order

Strategies are sorted by priority (ascending)
Each strategy receives the shared ValidationContext
Strategies can check existing_errors to skip already-flagged words
Strategies add their flagged positions to existing_errors
Errors from all strategies are collected and returned

Configuration

Enable/disable strategies via configuration:

from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_context_checker=True,  # Enable N-gram strategy
    use_phonetic=True,         # Enable homophone detection
    validation=ValidationConfig(
        use_homophone_detection=True,      # Toggle homophone strategy (default: True)
        use_orthography_validation=True,   # Toggle orthography strategy (default: True)
        enable_strategy_timing=False,      # Per-strategy timing at DEBUG level (default: False)
    ),
    # Semantic config enables semantic strategy (opt-in, requires trained model)
    semantic=SemanticConfig(
        model_path="./my-model/model.onnx",       # Your trained model
        use_proactive_scanning=True,
    ),
)

Error Types

Each strategy produces specific error types:

Error Type	Strategy	Description
`tone_ambiguity`	Tone	Tone mark disambiguation
`medial_order_error`	Orthography	Medial consonant order/compatibility
`syntax_error`	Syntactic	Grammar rule violation
`broken_compound`	BrokenCompound	Wrongly split compound word
`pos_sequence_error`	POS	Invalid POS sequence (P-P)
`question_structure`	Question	Question structure issue
`homophone_error`	Homophone	Sound-alike confusion
`confusable_error`	ConfusableSemantic	Phonetic/visual confusable (MLM-based)
`context_probability`	N-gram	Low probability sequence
`semantic_error`	Semantic	AI-detected anomaly (opt-in)

Best Practices

Priority Selection: Choose priorities that make sense for your validation order
Skip Flagged Words: Always check existing_errors to avoid duplicate errors
Skip Names: Respect the is_name_mask to avoid flagging proper names
Confidence Scores: Use appropriate confidence levels for your error type
Performance: Heavy validations (semantic) should run last

​Overview

​Fast-Path Exit

​Configuration

​Trade-offs

​ValidationContext

​Context Attributes

​Strategy Implementations

​ToneValidationStrategy (Priority: 10)

​OrthographyValidationStrategy (Priority: 15)

​SyntacticValidationStrategy (Priority: 20)

​BrokenCompoundStrategy (Priority: 25)

​POSSequenceValidationStrategy (Priority: 30)

​QuestionStructureValidationStrategy (Priority: 40)

​ConfusableSemanticStrategy (Priority: 48), Opt-in Required

​NgramContextValidationStrategy (Priority: 50)

​HomophoneValidationStrategy (Priority: 45)

​SemanticValidationStrategy (Priority: 70), Opt-in Required

​Creating Custom Strategies

​Strategy Composition

​Execution Order

​Configuration

​Error Types

​Best Practices

​See Also

Overview

Fast-Path Exit

Configuration

Trade-offs

ValidationContext

Context Attributes

Strategy Implementations

ToneValidationStrategy (Priority: 10)

OrthographyValidationStrategy (Priority: 15)

SyntacticValidationStrategy (Priority: 20)

BrokenCompoundStrategy (Priority: 25)

POSSequenceValidationStrategy (Priority: 30)

QuestionStructureValidationStrategy (Priority: 40)

ConfusableSemanticStrategy (Priority: 48), Opt-in Required

NgramContextValidationStrategy (Priority: 50)

HomophoneValidationStrategy (Priority: 45)

SemanticValidationStrategy (Priority: 70), Opt-in Required

Creating Custom Strategies

Strategy Composition

Execution Order

Configuration

Error Types

Best Practices

See Also