Syntactic Grammar Validation

N-gram models excel at scoring how “natural” a word sequence sounds, but they can’t enforce hard constraints. A double-verb sequence without an intervening particle is always wrong in Myanmar, regardless of how common the individual words are. The Syntactic Rule Checker runs as Layer 2.5, between word validation and N-gram scoring, to enforce these non-negotiable grammatical patterns.

Overview

The Syntactic Rule Checker runs after word validation but before the statistical N-gram check. It is designed to catch:

Grammatically Impossible Sequences: e.g., two verbs without a particle between them.
Particle Errors: Using a noun-particle after a verb, or vice versa.
Medial Consonant Confusions: ျ vs ြ based on linguistic roots (handled via lookup).
Sentence Structure: Missing sentence-final particles.

Architecture

The grammar system consists of:

Core Components

Component	Location	Description
`SyntacticRuleChecker`	`src/myspellchecker/grammar/engine.py`	Main grammar engine
`GrammarEngineConfig`	`src/myspellchecker/core/config/grammar_configs.py`	Engine configuration
Grammar Checkers	`src/myspellchecker/grammar/checkers/`	Specialized checkers

Grammar Checkers

Checker	File	Purpose
`AspectChecker`	`grammar/checkers/aspect.py`	Aspect marker validation
`ClassifierChecker`	`grammar/checkers/classifier.py`	Classifier usage
`CompoundChecker`	`grammar/checkers/compound.py`	Compound word validation
`MergedWordChecker`	`grammar/checkers/merged_word.py`	Segmenter mis-merge detection
`NegationChecker`	`grammar/checkers/negation.py`	Negation patterns
`RegisterChecker`	`grammar/checkers/register.py`	Formal/informal register

YAML Rule Files

Grammar rules are defined in YAML files located in src/myspellchecker/rules/: Core Grammar

File	Purpose
`grammar_rules.yaml`	Core grammar rules (POS sequences, particle agreement)
`particles.yaml`	Particle definitions and rules
`negation.yaml`	Negation patterns
`register.yaml`	Register (formal/informal) rules
`tone_rules.yaml`	Tone mark rules
`tense_markers.yaml`	Tense marker definitions
`morphology.yaml`	Morphological rules
`morphotactics.yaml`	Morphotactic constraints
`pronouns.yaml`	Pronoun definitions

Checkers and Patterns

File	Purpose
`aspects.yaml`	Aspect marker rules
`classifiers.yaml`	Classifier patterns
`compounds.yaml`	Compound word rules
`collocations.yaml`	Collocation patterns
`semantic_rules.yaml`	Semantic validation rules

Confusion and Error Correction

File	Purpose
`homophones.yaml`	Homophone confusion patterns
`homophone_confusion.yaml`	Homophone confusion matrix for detection
`confusable_pairs.yaml`	Visually/phonetically confusable word pairs
`confusion_matrix.yaml`	Character-level confusion matrix
`compound_confusion.yaml`	Compound word confusion patterns
`medial_confusion.yaml`	Medial consonant (`ျ`/`ြ`) confusion lookup
`stacking_pairs.yaml`	Stacking consonant pair rules
`medial_swap_pairs.yaml`	Medial swap pair corrections
`typo_corrections.yaml`	Common typo corrections
`orthographic_corrections.yaml`	Orthographic normalization rules

Detection and Scoring

File	Purpose
`detector_confidences.yaml`	Per-detector confidence thresholds
`corruption_weights.yaml`	Weights for synthetic error corruption
`ambiguous_words.yaml`	Ambiguous word definitions
`pos_inference.yaml`	POS inference rules
`rerank_rules.yaml`	Suggestion reranking rules

Key Features

1. Particle Agreement

Myanmar particles are highly specific to the part of speech they modify.

Verb Particles: မယ်, ခဲ့, နေ must follow verbs.
Noun Particles: မှာ, က, ကို must follow nouns.

Example Error:

Input: ကျောင်း သွား မှာ (“School go at”)
Analysis: သွား is a Verb. မှာ is usually a location marker (Noun particle).
Correction: Suggest မယ် (Future tense) or မလား (Question) depending on context, or flag as suspicious.

2. Medial Confusion (`ျ` vs `ြ`)

Many words sound similar but use different medials (Ya-pin vs Ya-yit).

Rule: ကျောင်း (School) vs ကြောင်း (Cause/Fact).
Logic:
- If preceded by a Verb (e.g., ဖြစ်), it implies “Cause”. Suggest ကြောင်း.
- If preceded by a Noun or at start, it implies “School”. Keep ကျောင်း.

3. POS Sequence Validation

We maintain a matrix of invalid POS transitions.

Verb -> Verb (Directly): Usually invalid. Needs a particle like ပြီး or ၍.

Configuration

Enable Grammar Checking

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig

# Enable rule-based validation
config = SpellCheckerConfig(
    use_rule_based_validation=True,  # Enable grammar checking
)
checker = SpellChecker(config=config)

Grammar Engine Configuration

from myspellchecker.core.config.grammar_configs import GrammarEngineConfig

# GrammarEngineConfig controls confidence thresholds for grammar checks
grammar_config = GrammarEngineConfig(
    default_confidence_threshold=0.80,
    exact_match_confidence=0.95,
    high_confidence=0.90,
    medium_confidence=0.85,
    pos_sequence_confidence=0.80,
)

Note: GrammarRuleConfig in myspellchecker.grammar.config is a separate class for loading YAML-based grammar rule definitions.

Using SyntacticRuleChecker Directly

from myspellchecker.grammar.engine import SyntacticRuleChecker

# Create checker with provider
checker = SyntacticRuleChecker(provider=provider)

# Check a sentence
words = ["ကျောင်း", "သွား", "မှာ"]
errors = checker.check_sequence(words)

for error in errors:
    # Each error is a tuple of (index, error_word, suggestion)
    index, word, suggestion = error
    print(f"Position: {index}")
    print(f"Word: {word}")
    print(f"Suggestion: {suggestion}")

Internally, the grammar engine uses RuleMatch dataclass (exported from myspellchecker.grammar.engine) for priority-based conflict resolution. RuleMatch has fields: position, word, suggestion, priority, rule_name, and confidence. The check_sequence method resolves conflicts and returns simplified (index, word, suggestion) tuples.

YAML Rule Format

Grammar rules are defined in YAML format with JSON Schema validation.

Example: grammar_rules.yaml

# Grammar rules for Myanmar language
version: "1.0.0"

pos_sequences:
  invalid:
    - ["V", "V"]  # Verb-Verb without particle
    - ["N", "N", "N", "N"]  # Too many consecutive nouns

  valid:
    - ["N", "P", "V", "P"]  # Subject-particle-verb-particle
    - ["N", "V", "P"]  # Subject-verb-particle

particle_rules:
  verb_particles:
    - "မယ်"
    - "ခဲ့"
    - "နေ"
    - "ပြီ"

  noun_particles:
    - "က"
    - "ကို"
    - "မှာ"
    - "တွင်"

Schema Validation

Rules are validated against JSON Schema:

src/myspellchecker/schemas/grammar_rules.schema.json

Error Types

Grammar checking produces errors with specific types:

Error Type	Description
`grammar_error`	General grammar violation
`particle_typo`	Particle usage error
`pos_sequence_error`	Invalid POS sequence
`aspect_typo`	Aspect marker error
`classifier_typo`	Classifier error
`mixed_register`	Mixed formal/informal
`incomplete_reduplication`	Incomplete reduplication

Integration with SpellChecker

Grammar checking is integrated into the validation pipeline:

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(
    use_rule_based_validation=True,
    use_context_checker=True,
)

checker = SpellChecker(config=config)
result = checker.check("ကျောင်း သွား မှာ")

# Filter grammar errors
grammar_errors = [
    e for e in result.errors
    if e.error_type in ("grammar_error", "particle_typo", "pos_sequence_error")
]

Performance

Operation	Complexity	Typical Time
POS Sequence Check	O(n)	~1ms
Particle Validation	O(1)	~0.5ms
Full Grammar Check	O(n)	~5ms

Next Steps

POS Tagging - How POS tags are assigned
Context Checking - N-gram context validation
Validation Strategies - Strategy pattern overview

​Overview

​Architecture

​Core Components

​Grammar Checkers

​YAML Rule Files

​Key Features

​1. Particle Agreement

​2. Medial Confusion (ျ vs ြ)

​3. POS Sequence Validation

​Configuration

​Enable Grammar Checking

​Grammar Engine Configuration

​Using SyntacticRuleChecker Directly

​YAML Rule Format

​Example: grammar_rules.yaml

​Schema Validation

​Error Types

​Integration with SpellChecker

​Performance

​Next Steps

Overview

Architecture

Core Components

Grammar Checkers

YAML Rule Files

Key Features

1. Particle Agreement

2. Medial Confusion (`ျ` vs `ြ`)

3. POS Sequence Validation

Configuration

Enable Grammar Checking

Grammar Engine Configuration

Using SyntacticRuleChecker Directly

YAML Rule Format

Example: grammar_rules.yaml

Schema Validation

Error Types

Integration with SpellChecker

Performance

Next Steps