The validation pipeline is composed of independent strategies, each targeting a specific error type, from tone mark disambiguation to AI-powered semantic analysis. Strategies execute in priority order and share context so later strategies can skip positions already flagged by earlier ones.
Overview
The validation pipeline processes text through multiple strategies, each checking for different error types:
| Strategy | Priority | Purpose | Error Type |
|---|
| ToneValidationStrategy | 10 | Tone mark disambiguation | tone_ambiguity |
| OrthographyValidationStrategy | 15 | Medial order and compatibility | medial_order_error |
| SyntacticValidationStrategy | 20 | Grammar rule checking | syntax_error |
| StatisticalConfusableStrategy | 24 | Bigram-based confusable detection | confusable_error |
| BrokenCompoundStrategy | 25 | Wrongly split compound words | broken_compound |
| POSSequenceValidationStrategy | 30 | POS sequence validation | pos_sequence_error |
| QuestionStructureValidationStrategy | 40 | Question structure | question_structure |
| HomophoneValidationStrategy | 45 | Homophone detection | homophone_error |
| ConfusableCompoundClassifierStrategy | 47 | MLP-based confusable/compound detection (opt-in) | broken_compound |
| ConfusableSemanticStrategy | 48 | MLM-enhanced confusable detection (opt-in) | confusable_error |
| NgramContextValidationStrategy | 50 | N-gram probability | context_probability |
| SemanticValidationStrategy | 70 | AI-powered semantic (opt-in) | semantic_error |
Lower priority values run first.
Fast-Path Exit
When enable_fast_path is True (the default), the pipeline uses a two-phase execution model:
- Structural phase (priority ≤ 25): Tone, Orthography, Syntactic, and BrokenCompound strategies always run.
- Contextual phase (priority > 25): POS sequence, Question, Homophone, Confusable, N-gram, and Semantic strategies only run if the structural phase found at least one error.
This dramatically reduces false positives on clean text — most sentences have no structural errors, and the contextual strategies are the primary source of false positives.
Configuration
from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig
# Default: fast-path enabled (lower FPR, may miss context-only errors)
config = SpellCheckerConfig()
# Disable fast-path for maximum recall
config = SpellCheckerConfig(
validation=ValidationConfig(enable_fast_path=False)
)
Trade-offs
| Setting | FPR | Recall | Use Case |
|---|
enable_fast_path=True | Lower (~43% on clean text) | May miss context-only errors | Production, real-time checking |
enable_fast_path=False | Higher | Full recall | Research, maximum accuracy |
The fast-path cutoff is at priority 25 (after BrokenCompoundStrategy). Strategies at priority 30+ (POS sequence, homophone, confusable, n-gram, semantic) are skipped on structurally clean sentences. If you need full contextual validation on all input, set enable_fast_path=False.
ValidationContext
All strategies receive a shared ValidationContext containing sentence-level information:
from myspellchecker.core.validation_strategies.base import ValidationContext
context = ValidationContext(
sentence="သူ သွား ကျောင်း",
words=["သူ", "သွား", "ကျောင်း"],
word_positions=[0, 6, 15],
is_name_mask=[False, False, False],
existing_errors={}, # Maps position -> error_type from previous strategies
sentence_type="statement", # statement, question, command
pos_tags=["PRON", "V", "N"] # POS tags if available
)
Context Attributes
| Attribute | Type | Description |
|---|
sentence | str | Full original sentence |
words | List[str] | Tokenized words |
word_positions | List[int] | Character position of each word |
is_name_mask | List[bool] | True if word is a proper name |
existing_errors | dict[int, str] | Maps word position to error_type from previous strategies |
existing_suggestions | dict[int, list[str]] | Suggestions from the strategy that first flagged each position |
existing_confidences | dict[int, float] | Confidence scores of first-flagged errors |
sentence_type | str | Sentence type for context |
pos_tags | List[str] | POS tags (if available) |
full_text | str | The full text being checked (not just the sentence) |
global_error_count | int | Tracks error count globally across sentences |
Strategy Implementations
ToneValidationStrategy (Priority: 10)
Handles tone mark disambiguation using context. Accepts an optional provider for word frequency lookup to suppress ambiguous high-frequency forms.
from myspellchecker.core.validation_strategies.tone_strategy import ToneValidationStrategy
from myspellchecker.text.tone import ToneDisambiguator
disambiguator = ToneDisambiguator()
strategy = ToneValidationStrategy(
tone_disambiguator=disambiguator,
provider=provider, # Optional: for frequency-based suppression
confidence_threshold=0.5, # Minimum confidence to report error
)
errors = strategy.validate(context)
Detection:
- Missing tone marks (ငါ → ငါး in number context)
- Wrong tone marks based on context
- Ambiguous words resolved by surrounding words
Frequency-based suppression: When both the original word and the correction are high-frequency (above high_freq_threshold), the error is suppressed. This prevents false positives on grammatically ambiguous forms like သူ့ (possessive) vs သူ (subject) where both are valid.
OrthographyValidationStrategy (Priority: 15)
Validates medial consonant ordering and compatibility (UTN #11 rules) at the word level. Uses a two-step check: medial order first, then compatibility. Accepts an optional provider for sorting suggestions by dictionary validity.
from myspellchecker.core.validation_strategies.orthography_strategy import OrthographyValidationStrategy
strategy = OrthographyValidationStrategy(
provider=provider, # Optional: sort suggestions by validity
confidence=0.9, # Default confidence for orthography errors
)
Detection:
- Medial order errors: Incorrect medial consonant order (e.g., ွ before ျ), which generates stripped variant suggestions
- Compatibility errors: Incompatible medial-consonant combinations with no suggestions, because the combination is invalid
SyntacticValidationStrategy (Priority: 20)
Validates grammar rules and particle usage.
from myspellchecker.core.validation_strategies.syntactic_strategy import SyntacticValidationStrategy
strategy = SyntacticValidationStrategy(
syntactic_rule_checker=syntactic_checker,
confidence=0.80
)
errors = strategy.validate(context)
Detection:
- Particle errors (မှာ vs မှ)
- Medial confusion (ျ vs ြ)
- Missing particles
- Invalid word combinations
- Duplicated sentence endings (e.g., သည်သည်), detected via fast-path before full syntactic check
- Split polite forms (ပါတယ် → ပါ + တယ်), automatically skipped to avoid false positives
BrokenCompoundStrategy (Priority: 25)
Detects compound words that were incorrectly split by a space. This is the inverse of merged word detection — instead of finding words that were wrongly joined, it finds words that were wrongly separated.
from myspellchecker.core.validation_strategies.broken_compound_strategy import BrokenCompoundStrategy
strategy = BrokenCompoundStrategy(
provider=provider,
rare_threshold=2000, # Max frequency for a word to be "rare"
compound_min_frequency=5000, # Min frequency for the compound form
compound_ratio=5.0, # Min ratio of compound_freq / rare_word_freq
confidence=0.8
)
errors = strategy.validate(context)
Parameters:
| Parameter | Type | Default | Description |
|---|
provider | WordRepository | required | Word repository with is_valid_word and get_word_frequency |
rare_threshold | int | 2000 | Maximum frequency for a word to be considered “rare” |
compound_min_frequency | int | 5000 | Minimum frequency for the compound to be flagged |
compound_ratio | float | 5.0 | Minimum ratio of compound frequency to rare word frequency |
confidence | float | 0.8 | Confidence score for broken compound errors |
Detection:
- Adjacent word pairs whose concatenation forms a valid, common dictionary word
- At least one component must be a rare word (below
rare_threshold)
- The compound form must be significantly more common than the rarer component
- Skips Pali/Sanskrit stacking fragments (virama U+1039) to avoid false positives
Example: “မနက် ဖြန်” (wrongly split) is flagged because “မနက်ဖြန်” (tomorrow) is a valid compound that is much more common than the rare component “ဖြန်”.
POSSequenceValidationStrategy (Priority: 30)
Validates POS tag sequences against expected patterns.
from myspellchecker.core.validation_strategies.pos_sequence_strategy import POSSequenceValidationStrategy
strategy = POSSequenceValidationStrategy(
viterbi_tagger=pos_tagger,
pos_disambiguator=disambiguator, # Optional: resolves multi-POS tags using R1-R5 rules
confidence=0.70,
)
errors = strategy.validate(context)
Detection:
- P-P: Consecutive particles → error (always flagged)
- N-N: Consecutive nouns without particle → warning (logged, not surfaced as error)
- V-V: Consecutive verbs → info (serial verb constructions are usually valid)
- V+N / N+V multi-POS check: When a noun also has V in its dictionary POS, validates context
- Sentence-final predicate check: Flags sentences with structural particles but no verb, suggests ဖြစ်သည် or ဖြစ်ပါသည်
POS disambiguation: When tags contain | (multi-POS), the optional pos_disambiguator resolves them using context-based R1-R5 rules before validation. Disambiguated tags are stored back in context for downstream strategies.
Serial Verb Support:
Myanmar is a serial verb language where verb-verb (V-V) sequences are often valid. The strategy recognizes valid serial verb constructions:
- Auxiliary verbs: နေ (progressive), ထား (resultative), လိုက် (action manner)
- Modal verbs: နိုင် (ability), ချင် (desire), ရ (permission)
- Directional verbs: သွား (away), လာ (toward)
# "စားသွား" (eat+go = go eat) is a valid V-V sequence
# Strategy checks is_valid_verb_sequence() before flagging V-V as error
QuestionStructureValidationStrategy (Priority: 40)
Validates question sentence structure.
from myspellchecker.core.validation_strategies.question_strategy import QuestionStructureValidationStrategy
strategy = QuestionStructureValidationStrategy(
confidence=0.75
)
errors = strategy.validate(context)
Detection:
- Missing question particles (လား, သလဲ)
- Wrong question particle for context
- Question word agreement
- Implicit questions: 2nd-person pronouns + completive endings detected as implicit questions (lower confidence ~0.55)
- Malformed question endings: Split ရဲ့ လဲ tokens merged and corrected
- Segmentation fragment filtering: Question words adjacent to previous word (no space gap) are masked to prevent false positives
Enclitic Question Particles:
The strategy detects question particles attached directly to verbs (enclitics):
# "သွားလား" (go+question = did you go?) is recognized as a proper question
# No error generated for verb+particle combinations
Negative Indefinite Handling:
The strategy correctly identifies negative indefinite constructions as statements, not questions:
# "ဘယ်သူမှ မလာဘူး" = "Nobody came" (statement, NOT question)
# Question word + "မှ" suffix + negative verb = statement pattern
ConfusableSemanticStrategy (Priority: 48), Opt-in Required
MLM-enhanced confusable detection that uses masked language modeling to catch valid-word confusables. Dynamically generates confusable variants using phonetic rules (aspiration swaps, medial swaps, tone mark changes, nasal endings) and uses MLM logits to determine if a variant is more likely in context. Requires a trained ONNX model.
from myspellchecker.core.validation_strategies.confusable_semantic_strategy import ConfusableSemanticStrategy
strategy = ConfusableSemanticStrategy(
semantic_checker=semantic_checker,
provider=provider,
confidence=0.80,
top_k=50,
logit_diff_threshold=3.0,
logit_diff_threshold_medial=2.0,
logit_diff_threshold_current_in_topk=5.0,
high_freq_threshold=50000,
high_freq_logit_diff=6.0,
min_word_length=2
)
errors = strategy.validate(context)
Parameters:
| Parameter | Type | Default | Description |
|---|
semantic_checker | SemanticChecker | required | SemanticChecker with loaded ONNX model |
provider | NgramRepository | required | Provider with word lookup and frequency data |
confidence | float | 0.80 | Confidence score for confusable errors |
top_k | int | 50 | Number of top MLM predictions to consider |
logit_diff_threshold | float | 3.0 | Default logit difference threshold |
logit_diff_threshold_medial | float | 2.0 | Lower threshold for ျ↔ြ medial swaps |
logit_diff_threshold_current_in_topk | float | 5.0 | Stricter threshold when current word is in top-K |
high_freq_threshold | int | 50000 | Frequency above which stricter thresholds apply |
high_freq_logit_diff | float | 6.0 | Logit diff threshold for high-frequency words |
min_word_length | int | 2 | Minimum word length to check |
freq_ratio_penalty_high | float | 2.0 | Additive penalty when variant/word frequency ratio exceeds 5x |
freq_ratio_penalty_mid | float | 1.0 | Additive penalty when ratio exceeds 2x |
visarga_penalty | float | 2.0 | Additive penalty for visarga-only pairs |
sentence_final_penalty | float | 0.5 | Additive penalty for sentence-final position |
Asymmetric thresholds protect against false positives with stacking penalties:
- Base threshold (highest wins): high-frequency word (6.0), current in top-K (5.0), medial ျ↔ြ swap (2.0), default (3.0)
- Additive penalties: frequency-ratio (+2.0 or +1.0), visarga-pair (+2.0), sentence-final (+0.5)
Detection:
- Generates confusable variants dynamically from phonetic rules (aspiration swaps, medial swaps, tone marks, nasal endings)
- Uses a single
predict_mask() call per candidate word to compare MLM logits
- Skips positions already flagged by earlier strategies
- High-frequency visarga pairs (both words above threshold) are hard-blocked to prevent false positives
NgramContextValidationStrategy (Priority: 50)
Uses bigram/trigram probabilities to detect unlikely sequences.
from myspellchecker.core.validation_strategies.ngram_strategy import NgramContextValidationStrategy
strategy = NgramContextValidationStrategy(
context_checker=ngram_checker,
provider=provider,
confidence_high=0.75,
confidence_low=0.6,
max_suggestions=5,
edit_distance=2
)
errors = strategy.validate(context)
Detection:
- Low probability word pairs
- Unusual word combinations
- Real-word errors (correct spelling, wrong context)
HomophoneValidationStrategy (Priority: 45)
Detects homophone confusion based on context.
from myspellchecker.core.validation_strategies.homophone_strategy import HomophoneValidationStrategy
strategy = HomophoneValidationStrategy(
homophone_checker=homophone_checker,
provider=ngram_provider,
context_checker=context_checker, # NgramContextChecker instance
confidence=0.80,
)
errors = strategy.validate(context)
Parameters:
| Parameter | Type | Default | Description |
|---|
homophone_checker | HomophoneChecker or None | required | HomophoneChecker instance; if None, strategy is disabled |
provider | NgramRepository | required | Provider for N-gram probability lookups |
context_checker | NgramContextChecker or None | None | NgramContextChecker that performs N-gram comparison via check_word_in_context() |
confidence | float | 0.8 | Confidence score for homophone errors |
Legacy kwargs (improvement_ratio, min_probability, high_freq_threshold, high_freq_improvement_ratio) are accepted but ignored for backward compatibility. These thresholds are managed internally by NgramContextChecker.compute_required_ratio().
Detection:
- Homophone pairs (ကား/ကာ, သာ/သား)
- Context-based correct form selection
- Sound-alike word confusion
SemanticValidationStrategy (Priority: 70), Opt-in Required
AI-powered validation using ONNX models. This strategy is not active by default. You must train a semantic model first, then configure SemanticConfig with the model path and set use_proactive_scanning=True.
from myspellchecker.core.validation_strategies.semantic_strategy import SemanticValidationStrategy
strategy = SemanticValidationStrategy(
semantic_checker=semantic_checker,
provider=provider, # DictionaryProvider for word lookups
use_proactive_scanning=True, # Must be True to enable — False by default
proactive_confidence_threshold=0.85,
min_word_length=2,
)
errors = strategy.validate(context)
use_proactive_scanning defaults to False. Without setting it to True, this strategy produces no errors even if a semantic model is loaded. Both a trained model and use_proactive_scanning=True are required.
Parameters:
| Parameter | Type | Default | Description |
|---|
semantic_checker | SemanticChecker or None | required | SemanticChecker with loaded ONNX model; if None, strategy is disabled |
provider | DictionaryProvider | required | Provider for word frequency and validity lookups |
use_proactive_scanning | bool | False | Enable proactive semantic scanning. Must be True for this strategy to do anything |
proactive_confidence_threshold | float | 0.85 | Minimum confidence to report semantic errors |
min_word_length | int | 2 | Minimum word length for semantic analysis |
Detection (two independent sub-checks):
- Proactive semantic scan: Masks each word and checks if MLM predictions disagree strongly with the original, limited to 8
predict_mask() calls per sentence
- Animacy detection: Flags inanimate subjects before subject/topic particles (က, ကို, သည်, မှာ, တွင်) and always runs even when proactive scanning is skipped
Error budget optimization: Proactive scanning is automatically skipped when there are already errors in the context (from earlier strategies), preventing cascade false positives from corrupted MLM context. Animacy detection is unaffected and always runs.
Skipped words: Common function words (particles and conjunctions, 22 words total) are excluded from proactive scanning as MLM disagreement on these is noise.
Creating Custom Strategies
Implement the ValidationStrategy abstract base class:
from myspellchecker.core.validation_strategies.base import (
ValidationStrategy,
ValidationContext
)
from myspellchecker.core.response import Error, ContextError
class CustomValidationStrategy(ValidationStrategy):
"""Custom validation strategy."""
def __init__(self, config: dict):
self.config = config
def validate(self, context: ValidationContext) -> list[Error]:
"""Validate and return errors."""
errors = []
for i, word in enumerate(context.words):
# Skip if already has an error
if context.word_positions[i] in context.existing_errors:
continue
# Skip proper names
if i < len(context.is_name_mask) and context.is_name_mask[i]:
continue
# Your validation logic
if self._is_invalid(word, context):
errors.append(ContextError(
text=word,
position=context.word_positions[i],
error_type="custom_error",
suggestions=self._get_suggestions(word),
confidence=0.80,
probability=0.0,
prev_word=context.words[i-1] if i > 0 else ""
))
# Mark as having error (existing_errors is a dict[int, str])
context.existing_errors[context.word_positions[i]] = "custom_error"
return errors
def priority(self) -> int:
"""Return priority (lower runs first)."""
return 45 # Between POS and N-gram
def _is_invalid(self, word: str, context: ValidationContext) -> bool:
# Implement validation logic
return False
def _get_suggestions(self, word: str) -> list[str]:
# Generate suggestions
return []
Strategy Composition
In the default pipeline, SpellChecker coordinates validation directly through its validators:
- SyllableValidator: validates each syllable (layer 1)
- WordValidator: validates words via SymSpell (layer 2)
- ContextValidator: orchestrates validation strategies (layer 3)
The ContextValidator receives a list of strategies built by SpellCheckerBuilder and executes them in priority order within each sentence.
from myspellchecker.core.builder import SpellCheckerBuilder
# Builder wires strategies automatically based on config
checker = SpellCheckerBuilder().with_config(config).with_provider(provider).build()
result = checker.check("မြန်မာ စာ")
Execution Order
- Strategies are sorted by priority (ascending)
- Each strategy receives the shared
ValidationContext
- Strategies can check
existing_errors to skip already-flagged words
- Strategies add their flagged positions to
existing_errors
- Errors from all strategies are collected and returned
Configuration
Enable/disable strategies via configuration:
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig
config = SpellCheckerConfig(
use_context_checker=True, # Enable N-gram strategy
use_phonetic=True, # Enable homophone detection
validation=ValidationConfig(
use_homophone_detection=True, # Toggle homophone strategy (default: True)
use_orthography_validation=True, # Toggle orthography strategy (default: True)
enable_strategy_timing=False, # Per-strategy timing at DEBUG level (default: False)
),
# Semantic config enables semantic strategy (opt-in, requires trained model)
semantic=SemanticConfig(
model_path="./my-model/model.onnx", # Your trained model
use_proactive_scanning=True,
),
)
Error Types
Each strategy produces specific error types:
| Error Type | Strategy | Description |
|---|
tone_ambiguity | Tone | Tone mark disambiguation |
medial_order_error | Orthography | Medial consonant order/compatibility |
syntax_error | Syntactic | Grammar rule violation |
broken_compound | BrokenCompound | Wrongly split compound word |
pos_sequence_error | POS | Invalid POS sequence (P-P) |
question_structure | Question | Question structure issue |
homophone_error | Homophone | Sound-alike confusion |
confusable_error | ConfusableSemantic | Phonetic/visual confusable (MLM-based) |
context_probability | N-gram | Low probability sequence |
semantic_error | Semantic | AI-detected anomaly (opt-in) |
Best Practices
- Priority Selection: Choose priorities that make sense for your validation order
- Skip Flagged Words: Always check
existing_errors to avoid duplicate errors
- Skip Names: Respect the
is_name_mask to avoid flagging proper names
- Confidence Scores: Use appropriate confidence levels for your error type
- Performance: Heavy validations (semantic) should run last
See Also