Every validation layer depends on accurate syllable boundaries. TheDocumentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
RegexSegmenter uses three complementary regex patterns to identify where one syllable ends and the next begins, handling stacked consonants, Kinzi formations, and mixed-script text.
Overview
Myanmar text has no whitespace between words, making segmentation challenging. The syllable segmenter breaks continuous Myanmar text into individual syllables using regex-based pattern matching combined with syllable rule validation.Algorithm Design
TheRegexSegmenter uses three regex patterns to identify syllable boundaries:
Pattern 1: Myanmar Consonant Syllable Start
- Negative lookbehind:
(?<!(?<!\u103a)\u1039)- NOT preceded by a stacking Virama (unless preceded by Asat for Kinzi) - Negative lookahead:
(?!\u103a)- NOT followed by Asat (the Virama case is handled by the lookbehind, so the lookahead only checks Asat)
Pattern 2: Other Syllable Starters
- Independent vowels (U+1022-U+102A)
- Great Sa (U+103F)
- Symbols (U+104C-U+104F)
- Digits (U+1040-U+1049)
- Punctuation (U+104A-U+104B)
Pattern 3: Non-Myanmar Characters
Implementation
RegexSegmenter Class
Configuration Options
Syllable Validation
After segmentation, each syllable is validated usingSyllableRuleValidator:
Myanmar Syllable Structure
A valid Myanmar syllable follows this pattern:| Component | Unicode Range | Examples |
|---|---|---|
| Consonant | U+1000-U+1021 | က, ခ, ဂ, မ, န |
| Medials | U+103B-U+103E | ျ, ြ, ွ, ှ |
| Vowels | U+102B-U+1032 | ါ, ာ, ိ, ု, ေ |
| Tone marks | U+1036-U+1038 | ံ, ့, း |
| Asat | U+103A | ် (syllable killer) |
| Virama | U+1039 | ္ (consonant stacker) |
Performance
The segmenter has two implementations:| Implementation | Speed | Use Case |
|---|---|---|
Pure Python (RegexSegmenter) | ~10ms/1K chars | Default, portable |
Cython (normalize_c.pyx) | ~1ms/1K chars | Production, auto-enabled |
Edge Cases
Stacked Consonants
Stacked consonants (using Virama U+1039) stay together:Kinzi Formation
Kinzi (Asat + Virama + Consonant) is handled correctly:Mixed Script
Non-Myanmar text is grouped together:See Also
- Syllable Validation - Validation rules
- Word Segmentation - Word-level assembly
- Cython Guide - Performance optimization