N-gram models excel at scoring how “natural” a word sequence sounds, but they can’t enforce hard constraints. A double-verb sequence without an intervening particle is always wrong in Myanmar, regardless of how common the individual words are. The Syntactic Rule Checker runs as Layer 2.5, between word validation and N-gram scoring, to enforce these non-negotiable grammatical patterns.
Overview
The Syntactic Rule Checker runs after word validation but before the statistical N-gram check. It is designed to catch:
- Grammatically Impossible Sequences: e.g., two verbs without a particle between them.
- Particle Errors: Using a noun-particle after a verb, or vice versa.
- Medial Consonant Confusions:
ျ vs ြ based on linguistic roots (handled via lookup).
- Sentence Structure: Missing sentence-final particles.
Architecture
The grammar system consists of:
Core Components
| Component | Location | Description |
|---|
SyntacticRuleChecker | src/myspellchecker/grammar/engine.py | Main grammar engine |
GrammarEngineConfig | src/myspellchecker/core/config/grammar_configs.py | Engine configuration |
| Grammar Checkers | src/myspellchecker/grammar/checkers/ | Specialized checkers |
Grammar Checkers
| Checker | File | Purpose |
|---|
AspectChecker | grammar/checkers/aspect.py | Aspect marker validation |
ClassifierChecker | grammar/checkers/classifier.py | Classifier usage |
CompoundChecker | grammar/checkers/compound.py | Compound word validation |
MergedWordChecker | grammar/checkers/merged_word.py | Segmenter mis-merge detection |
NegationChecker | grammar/checkers/negation.py | Negation patterns |
RegisterChecker | grammar/checkers/register.py | Formal/informal register |
YAML Rule Files
Grammar rules are defined in YAML files located in src/myspellchecker/rules/:
Core Grammar
| File | Purpose |
|---|
grammar_rules.yaml | Core grammar rules (POS sequences, particle agreement) |
particles.yaml | Particle definitions and rules |
negation.yaml | Negation patterns |
register.yaml | Register (formal/informal) rules |
tone_rules.yaml | Tone mark rules |
tense_markers.yaml | Tense marker definitions |
morphology.yaml | Morphological rules |
morphotactics.yaml | Morphotactic constraints |
pronouns.yaml | Pronoun definitions |
Checkers and Patterns
| File | Purpose |
|---|
aspects.yaml | Aspect marker rules |
classifiers.yaml | Classifier patterns |
compounds.yaml | Compound word rules |
collocations.yaml | Collocation patterns |
semantic_rules.yaml | Semantic validation rules |
Confusion and Error Correction
| File | Purpose |
|---|
homophones.yaml | Homophone confusion patterns |
homophone_confusion.yaml | Homophone confusion matrix for detection |
confusable_pairs.yaml | Visually/phonetically confusable word pairs |
confusion_matrix.yaml | Character-level confusion matrix |
compound_confusion.yaml | Compound word confusion patterns |
medial_confusion.yaml | Medial consonant (ျ/ြ) confusion lookup |
stacking_pairs.yaml | Stacking consonant pair rules |
medial_swap_pairs.yaml | Medial swap pair corrections |
typo_corrections.yaml | Common typo corrections |
orthographic_corrections.yaml | Orthographic normalization rules |
Detection and Scoring
| File | Purpose |
|---|
detector_confidences.yaml | Per-detector confidence thresholds |
corruption_weights.yaml | Weights for synthetic error corruption |
ambiguous_words.yaml | Ambiguous word definitions |
pos_inference.yaml | POS inference rules |
rerank_rules.yaml | Suggestion reranking rules |
Key Features
1. Particle Agreement
Myanmar particles are highly specific to the part of speech they modify.
- Verb Particles:
မယ်, ခဲ့, နေ must follow verbs.
- Noun Particles:
မှာ, က, ကို must follow nouns.
Example Error:
- Input:
ကျောင်း သွား မှာ (“School go at”)
- Analysis:
သွား is a Verb. မှာ is usually a location marker (Noun particle).
- Correction: Suggest
မယ် (Future tense) or မလား (Question) depending on context, or flag as suspicious.
Many words sound similar but use different medials (Ya-pin vs Ya-yit).
- Rule:
ကျောင်း (School) vs ကြောင်း (Cause/Fact).
- Logic:
- If preceded by a Verb (e.g.,
ဖြစ်), it implies “Cause”. Suggest ကြောင်း.
- If preceded by a Noun or at start, it implies “School”. Keep
ကျောင်း.
3. POS Sequence Validation
We maintain a matrix of invalid POS transitions.
Verb -> Verb (Directly): Usually invalid. Needs a particle like ပြီး or ၍.
Configuration
Enable Grammar Checking
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
# Enable rule-based validation
config = SpellCheckerConfig(
use_rule_based_validation=True, # Enable grammar checking
)
checker = SpellChecker(config=config)
Grammar Engine Configuration
from myspellchecker.core.config.grammar_configs import GrammarEngineConfig
# GrammarEngineConfig controls confidence thresholds for grammar checks
grammar_config = GrammarEngineConfig(
default_confidence_threshold=0.80,
exact_match_confidence=0.95,
high_confidence=0.90,
medium_confidence=0.85,
pos_sequence_confidence=0.80,
)
Note: GrammarRuleConfig in myspellchecker.grammar.config is a separate class for loading YAML-based grammar rule definitions.
Using SyntacticRuleChecker Directly
from myspellchecker.grammar.engine import SyntacticRuleChecker
# Create checker with provider
checker = SyntacticRuleChecker(provider=provider)
# Check a sentence
words = ["ကျောင်း", "သွား", "မှာ"]
errors = checker.check_sequence(words)
for error in errors:
# Each error is a tuple of (index, error_word, suggestion)
index, word, suggestion = error
print(f"Position: {index}")
print(f"Word: {word}")
print(f"Suggestion: {suggestion}")
Internally, the grammar engine uses RuleMatch dataclass (exported from myspellchecker.grammar.engine) for priority-based conflict resolution. RuleMatch has fields: position, word, suggestion, priority, rule_name, and confidence. The check_sequence method resolves conflicts and returns simplified (index, word, suggestion) tuples.
Grammar rules are defined in YAML format with JSON Schema validation.
Example: grammar_rules.yaml
# Grammar rules for Myanmar language
version: "1.0.0"
pos_sequences:
invalid:
- ["V", "V"] # Verb-Verb without particle
- ["N", "N", "N", "N"] # Too many consecutive nouns
valid:
- ["N", "P", "V", "P"] # Subject-particle-verb-particle
- ["N", "V", "P"] # Subject-verb-particle
particle_rules:
verb_particles:
- "မယ်"
- "ခဲ့"
- "နေ"
- "ပြီ"
noun_particles:
- "က"
- "ကို"
- "မှာ"
- "တွင်"
Schema Validation
Rules are validated against JSON Schema:
src/myspellchecker/schemas/grammar_rules.schema.json
Error Types
Grammar checking produces errors with specific types:
| Error Type | Description |
|---|
grammar_error | General grammar violation |
particle_typo | Particle usage error |
pos_sequence_error | Invalid POS sequence |
aspect_typo | Aspect marker error |
classifier_typo | Classifier error |
mixed_register | Mixed formal/informal |
incomplete_reduplication | Incomplete reduplication |
Integration with SpellChecker
Grammar checking is integrated into the validation pipeline:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
config = SpellCheckerConfig(
use_rule_based_validation=True,
use_context_checker=True,
)
checker = SpellChecker(config=config)
result = checker.check("ကျောင်း သွား မှာ")
# Filter grammar errors
grammar_errors = [
e for e in result.errors
if e.error_type in ("grammar_error", "particle_typo", "pos_sequence_error")
]
| Operation | Complexity | Typical Time |
|---|
| POS Sequence Check | O(n) | ~1ms |
| Particle Validation | O(1) | ~0.5ms |
| Full Grammar Check | O(n) | ~5ms |
Next Steps