Skip to main content
N-gram models excel at scoring how “natural” a word sequence sounds, but they can’t enforce hard constraints. A double-verb sequence without an intervening particle is always wrong in Myanmar, regardless of how common the individual words are. The Syntactic Rule Checker runs as Layer 2.5, between word validation and N-gram scoring, to enforce these non-negotiable grammatical patterns.

Overview

The Syntactic Rule Checker runs after word validation but before the statistical N-gram check. It is designed to catch:
  1. Grammatically Impossible Sequences: e.g., two verbs without a particle between them.
  2. Particle Errors: Using a noun-particle after a verb, or vice versa.
  3. Medial Consonant Confusions: vs based on linguistic roots (handled via lookup).
  4. Sentence Structure: Missing sentence-final particles.

Architecture

The grammar system consists of:

Core Components

ComponentLocationDescription
SyntacticRuleCheckersrc/myspellchecker/grammar/engine.pyMain grammar engine
GrammarEngineConfigsrc/myspellchecker/core/config/grammar_configs.pyEngine configuration
Grammar Checkerssrc/myspellchecker/grammar/checkers/Specialized checkers

Grammar Checkers

CheckerFilePurpose
AspectCheckergrammar/checkers/aspect.pyAspect marker validation
ClassifierCheckergrammar/checkers/classifier.pyClassifier usage
CompoundCheckergrammar/checkers/compound.pyCompound word validation
MergedWordCheckergrammar/checkers/merged_word.pySegmenter mis-merge detection
NegationCheckergrammar/checkers/negation.pyNegation patterns
RegisterCheckergrammar/checkers/register.pyFormal/informal register

YAML Rule Files

Grammar rules are defined in YAML files located in src/myspellchecker/rules/: Core Grammar
FilePurpose
grammar_rules.yamlCore grammar rules (POS sequences, particle agreement)
particles.yamlParticle definitions and rules
negation.yamlNegation patterns
register.yamlRegister (formal/informal) rules
tone_rules.yamlTone mark rules
tense_markers.yamlTense marker definitions
morphology.yamlMorphological rules
morphotactics.yamlMorphotactic constraints
pronouns.yamlPronoun definitions
Checkers and Patterns
FilePurpose
aspects.yamlAspect marker rules
classifiers.yamlClassifier patterns
compounds.yamlCompound word rules
collocations.yamlCollocation patterns
semantic_rules.yamlSemantic validation rules
Confusion and Error Correction
FilePurpose
homophones.yamlHomophone confusion patterns
homophone_confusion.yamlHomophone confusion matrix for detection
confusable_pairs.yamlVisually/phonetically confusable word pairs
confusion_matrix.yamlCharacter-level confusion matrix
compound_confusion.yamlCompound word confusion patterns
medial_confusion.yamlMedial consonant (/) confusion lookup
stacking_pairs.yamlStacking consonant pair rules
medial_swap_pairs.yamlMedial swap pair corrections
typo_corrections.yamlCommon typo corrections
orthographic_corrections.yamlOrthographic normalization rules
Detection and Scoring
FilePurpose
detector_confidences.yamlPer-detector confidence thresholds
corruption_weights.yamlWeights for synthetic error corruption
ambiguous_words.yamlAmbiguous word definitions
pos_inference.yamlPOS inference rules
rerank_rules.yamlSuggestion reranking rules

Key Features

1. Particle Agreement

Myanmar particles are highly specific to the part of speech they modify.
  • Verb Particles: မယ်, ခဲ့, နေ must follow verbs.
  • Noun Particles: မှာ, က, ကို must follow nouns.
Example Error:
  • Input: ကျောင်း သွား မှာ (“School go at”)
  • Analysis: သွား is a Verb. မှာ is usually a location marker (Noun particle).
  • Correction: Suggest မယ် (Future tense) or မလား (Question) depending on context, or flag as suspicious.

2. Medial Confusion ( vs )

Many words sound similar but use different medials (Ya-pin vs Ya-yit).
  • Rule: ကျောင်း (School) vs ကြောင်း (Cause/Fact).
  • Logic:
    • If preceded by a Verb (e.g., ဖြစ်), it implies “Cause”. Suggest ကြောင်း.
    • If preceded by a Noun or at start, it implies “School”. Keep ကျောင်း.

3. POS Sequence Validation

We maintain a matrix of invalid POS transitions.
  • Verb -> Verb (Directly): Usually invalid. Needs a particle like ပြီး or .

Configuration

Enable Grammar Checking

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig

# Enable rule-based validation
config = SpellCheckerConfig(
    use_rule_based_validation=True,  # Enable grammar checking
)
checker = SpellChecker(config=config)

Grammar Engine Configuration

from myspellchecker.core.config.grammar_configs import GrammarEngineConfig

# GrammarEngineConfig controls confidence thresholds for grammar checks
grammar_config = GrammarEngineConfig(
    default_confidence_threshold=0.80,
    exact_match_confidence=0.95,
    high_confidence=0.90,
    medium_confidence=0.85,
    pos_sequence_confidence=0.80,
)
Note: GrammarRuleConfig in myspellchecker.grammar.config is a separate class for loading YAML-based grammar rule definitions.

Using SyntacticRuleChecker Directly

from myspellchecker.grammar.engine import SyntacticRuleChecker

# Create checker with provider
checker = SyntacticRuleChecker(provider=provider)

# Check a sentence
words = ["ကျောင်း", "သွား", "မှာ"]
errors = checker.check_sequence(words)

for error in errors:
    # Each error is a tuple of (index, error_word, suggestion)
    index, word, suggestion = error
    print(f"Position: {index}")
    print(f"Word: {word}")
    print(f"Suggestion: {suggestion}")
Internally, the grammar engine uses RuleMatch dataclass (exported from myspellchecker.grammar.engine) for priority-based conflict resolution. RuleMatch has fields: position, word, suggestion, priority, rule_name, and confidence. The check_sequence method resolves conflicts and returns simplified (index, word, suggestion) tuples.

YAML Rule Format

Grammar rules are defined in YAML format with JSON Schema validation.

Example: grammar_rules.yaml

# Grammar rules for Myanmar language
version: "1.0.0"

pos_sequences:
  invalid:
    - ["V", "V"]  # Verb-Verb without particle
    - ["N", "N", "N", "N"]  # Too many consecutive nouns

  valid:
    - ["N", "P", "V", "P"]  # Subject-particle-verb-particle
    - ["N", "V", "P"]  # Subject-verb-particle

particle_rules:
  verb_particles:
    - "မယ်"
    - "ခဲ့"
    - "နေ"
    - "ပြီ"

  noun_particles:
    - "က"
    - "ကို"
    - "မှာ"
    - "တွင်"

Schema Validation

Rules are validated against JSON Schema:
  • src/myspellchecker/schemas/grammar_rules.schema.json

Error Types

Grammar checking produces errors with specific types:
Error TypeDescription
grammar_errorGeneral grammar violation
particle_typoParticle usage error
pos_sequence_errorInvalid POS sequence
aspect_typoAspect marker error
classifier_typoClassifier error
mixed_registerMixed formal/informal
incomplete_reduplicationIncomplete reduplication

Integration with SpellChecker

Grammar checking is integrated into the validation pipeline:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(
    use_rule_based_validation=True,
    use_context_checker=True,
)

checker = SpellChecker(config=config)
result = checker.check("ကျောင်း သွား မှာ")

# Filter grammar errors
grammar_errors = [
    e for e in result.errors
    if e.error_type in ("grammar_error", "particle_typo", "pos_sequence_error")
]

Performance

OperationComplexityTypical Time
POS Sequence CheckO(n)~1ms
Particle ValidationO(1)~0.5ms
Full Grammar CheckO(n)~5ms

Next Steps