Skip to main content
Layer 1 of the validation pipeline catches approximately 90% of spelling errors by validating Myanmar syllable structure against orthographic rules before dictionary lookup.

Overview

The SyllableRuleValidator performs 22 phonotactic checks — from zero-width character rejection through medial compatibility to tone mark rules — entirely without a dictionary. Syllables that pass structural validation then proceed to SyllableValidator for dictionary lookup and suggestion generation.

Architecture

The system uses a dual-implementation pattern:
SyllableRuleValidator (syllable_rules.py)
_SyllableRuleValidatorPython: Pure Python fallback
SyllableRuleValidator (from syllable_rules_c): Cython optimized, auto-selected
At runtime, the Cython version is used if available, providing ~8x speedup.

Validation Pipeline

Layer Flow

        +-------------------+
        |    Input Text     |
        +---------+---------+
                  |
                  v
+------------------------------------------+
|  Layer 1: SyllableRuleValidator          |
|                                          |
|  Structural validation (syllable_rules)  |
|   - 22+ phonotactic rules                |
|   - No dictionary needed                 |
|   - Returns True / False                 |
+--------------------+---------------------+
                     |
                     | if valid
                     v
+------------------------------------------+
|  Layer 1b: SyllableValidator             |
|                                          |
|  Dictionary lookup (validators.py)       |
|   - Uses SyllableRepository              |
|   - Frequency threshold check            |
|   - Generates suggestions                |
+------------------------------------------+

SyllableRuleValidator

Initialization

from myspellchecker.core.syllable_rules import SyllableRuleValidator

# Default settings
validator = SyllableRuleValidator()

# Custom settings
validator = SyllableRuleValidator(
    max_syllable_length=15,      # Maximum valid syllable length
    corruption_threshold=3,       # Max consecutive identical chars
    strict=True,                  # Enable strict mode
    allow_extended_myanmar=False  # Only standard Burmese
)

Basic Usage

from myspellchecker.core.syllable_rules import SyllableRuleValidator

validator = SyllableRuleValidator()

# Valid syllables
validator.validate("မြန်")  # True
validator.validate("မာ")    # True
validator.validate("ကျွန်") # True

# Invalid syllables
validator.validate("ြမန်")  # False - medial without consonant
validator.validate("")      # False - empty
validator.validate("ကကကက") # False - corruption (4 identical)

Validation Rules (22 Checks)

The validate() method performs 22 checks in order:

Phase 1: Basic Checks

#CheckPurpose
1Zero-width character rejectionDetect encoding issues
2Corruption checkDetect data corruption (length, repetition)
3Start character checkMust start with consonant or independent vowel
4Base character validationVerify valid Myanmar base character

Phase 2: Structure Rules

#CheckPurpose
5Independent vowel rulesIndependent vowels can’t take medials/vowels
6Structure sanityMedial sequences, ordering, Visarga position
7Kinzi pattern validationValidate င်္ sequences
8Asat predecessor checkAsat must follow consonant

Phase 3: Compatibility Rules

#CheckPurpose
9Unexpected consonant detectionMultiple unconnected consonants
10Medial compatibilityConsonant-medial phonotactics
11Medial-vowel compatibilityMedial+vowel combination validity
12Tone rulesStop finals, tone conflicts
13Virama usage checkStacking must not end syllable
14Vowel combinations (digraphs)Valid multi-vowel patterns
15Vowel exclusivityUpper vs lower vowel slots
16E vowel combinationsေ combination restrictions and position
17Great Sa rulesဿ usage restrictions
18Anusvara compatibilityံ vowel restrictions
19Asat countMaximum asat characters per syllable
20Double diacriticsNo duplicate diacritics
21Tall A / Aa exclusivityါ and ာ are mutually exclusive
22Dot below positionDot below must follow valid base

Strict Mode Additional Checks

When strict=True, these additional checks are applied:
CheckPurpose
Virama countMax 1 virama (2 with Kinzi)
Anusvara + Asat conflictIncompatible combination
Asat before vowelInvalid sequence
Tone strictnessMax 1 tone mark per syllable
Tone positionTone marks must be at end
Character scopeOnly core Myanmar characters
Diacritic uniquenessNo duplicate medials/vowels
One final ruleMax 1 final element
Strict KinziNga + Virama needs Asat
Virama orderingVirama before medials
Pat Sint validityStacking rules (Vagga logic)

Myanmar Syllable Structure

Valid Myanmar syllables follow this pattern:
Consonant + [Medial(s)] + [Vowel] + [Tone] + [Final]

Character Categories

ComponentUnicode RangeExamples
ConsonantsU+1000-U+1021က ခ ဂ ဃ င စ ဆ ဇ ဈ ည ဋ ဌ ဍ ဎ ဏ တ ထ ဒ ဓ န ပ ဖ ဗ ဘ မ ယ ရ လ ဝ သ ဟ ဠ အ
MedialsU+103B-U+103Eျ (Ya) ြ (Ra) ွ (Wa) ှ (Ha)
VowelsU+102B-U+1032ါ ာ ိ ီ ု ူ ေ ဲ
Tone marksU+1036-U+1038ံ ့ း
AsatU+103A
ViramaU+1039

Valid Medial Sequences

VALID_MEDIAL_SEQUENCES = {
    # Four-medial (Ya+Ra+Wa+Ha)
    "ျြွှ",
    # Three-medial combinations
    "ျြွ", "ျြှ", "ျွှ", "ြွှ",
    # Two-medial combinations (canonical order: Ya > Ra > Wa > Ha)
    "ျြ", "ျွ", "ျှ", "ြွ", "ြှ", "ွှ",
    # Single medials
    "ျ", "ြ", "ွ", "ှ",
}

Medial Compatibility

Not all consonants can take all medials:
# Medial Ya (ျ) compatible consonants
COMPATIBLE_YA = {"က", "ခ", "ဂ", "ဃ", "င", "စ", "ဆ", "ဇ", "ည", ...}

# Medial Ra (ြ) compatible consonants
COMPATIBLE_RA = {"က", "ခ", "ဂ", "ဃ", "င", "စ", "ဆ", "ဇ", ...}

# Medial Wa (ွ) - broadly compatible
COMPATIBLE_WA = {"က", "ခ", "ဂ", "ဃ", "င", "စ", "ဆ", ...}

# Medial Ha (ှ) - only sonorants
COMPATIBLE_HA = {"မ", "န", "ည", "ဏ", "လ", "ရ", "ဝ", "ယ"}

Special Patterns

Kinzi (င်္)

Kinzi is a nasalization marker in Pali/Sanskrit loanwords:
# Valid Kinzi pattern: Nga + Asat + Virama + Consonant
kinzi_seq = "င" + "်" + "္"  # U+1004 + U+103A + U+1039

# Example: သင်္ဘော (ship)
validator.validate("သင်္ဘော")  # True

# Invalid: Kinzi without following consonant
validator.validate("သင်္")  # False

Stacking (Pat Sint)

Consonant stacking follows Vagga (row) rules:
# Valid: Same-row stacking
validator.validate("က္က")  # True - Ka row
validator.validate("မ္မ")  # True - Ma row

# Pali/Sanskrit exceptions
validator.validate("က္ခ")  # True - Exception for loanwords

Great Sa (ဿ)

The doubled Sa conjunct has special rules:
# Great Sa cannot take medials or stack
validator.validate("ဿ")    # True
validator.validate("ဿွ")   # False - no medials
validator.validate("ဿ္က")  # False - no stacking

Integration with SyllableValidator

The rule validator integrates with the full validation pipeline:
from myspellchecker.core.validators import SyllableValidator
from myspellchecker.core.syllable_rules import SyllableRuleValidator

# SyllableValidator uses SyllableRuleValidator internally
validator = SyllableValidator.create(
    repository=provider,
    segmenter=segmenter,
    symspell=symspell,
    config=config,
    syllable_rule_validator=SyllableRuleValidator(strict=True),
)

# Validate returns errors with suggestions
errors = validator.validate("invalid text here")

Performance

ImplementationSpeedNotes
Pure Python~80μs/syllableFallback
Cython~10μs/syllable8x faster

Check Implementation

from myspellchecker.core.syllable_rules import _USING_CYTHON

print(f"Using Cython: {_USING_CYTHON}")

Configuration Options

Strict vs Lenient Mode

# Strict mode (default) - for formal documents
validator = SyllableRuleValidator(strict=True)

# Lenient mode - for informal text, transliterations
validator = SyllableRuleValidator(strict=False)
Strict mode enforces:
  • Pali/Sanskrit stacking rules (Vagga logic)
  • Canonical character ordering
  • Stricter tone mark rules
  • Core Myanmar characters only

Extended Myanmar

# Allow Extended Myanmar blocks (Shan, Mon, etc.)
validator = SyllableRuleValidator(allow_extended_myanmar=True)

# Standard Burmese only (default)
validator = SyllableRuleValidator(allow_extended_myanmar=False)

Error Messages

When validation fails, the check that failed can be identified for debugging:
# For debugging, check individual rules
validator = SyllableRuleValidator()

syllable = "ြမန်"  # Invalid: starts with medial

# These methods can help identify the issue
validator._check_start_char(syllable)         # False - fails here
validator._check_medial_compatibility(syllable)  # Not reached

See Also