Myanmar Script & Unicode
Characters & Components
Characters & Components
| Term | Definition |
|---|---|
| Consonant | One of 35 consonant characters in Myanmar script: 34 base consonants (U+1000–U+1021) plus Great Sa (ဿ, U+103F). |
| Dependent Vowel | Vowel signs that attach to consonants (U+102B–U+1032). Cannot stand alone. |
| Independent Vowel | Vowel characters that can stand alone without a consonant (U+1023–U+102A). |
| Medial | Consonant modifiers that appear between the base consonant and vowel. Four medials exist: ျ (ya-pin), ြ (ya-yit), ွ (wa-hswe), ှ (ha-htoe). |
| Syllable | The fundamental unit of Myanmar text. Consists of consonant + optional medials + optional vowels + optional finals. |
| Tone Mark | Characters that indicate tone, nasalization, or emphasis: ံ (anusvara, U+1036), ့ (dot below, U+1037), and း (visarga, U+1038). |
Special Characters
Special Characters
| Term | Definition |
|---|---|
| Asat (်) | Myanmar Unicode character (U+103A) that “kills” the inherent vowel of a consonant, creating a final consonant sound. Also called “killer” or “vowel killer”. |
| Anusvara (ံ) | Myanmar Unicode character (U+1036) indicating nasalization of the preceding vowel. |
| Kinzi | A special stacking form where င် appears above the following consonant using virama (္). |
| Virama (္) | Myanmar Unicode character (U+1039) used for consonant stacking. |
| Visarga (း) | Myanmar Unicode character (U+1038) indicating emphasis or sentence finality. |
| Zero-Width Characters | Invisible Unicode characters (ZWSP, ZWNJ, ZWJ, BOM) that should typically be removed during normalization. |
Encoding & Normalization
Encoding & Normalization
| Term | Definition |
|---|---|
| Unicode | International standard for text encoding. Myanmar script uses range U+1000–U+109F plus extensions. |
| Myanmar Extended-A | Unicode block U+AA60–U+AA7F containing additional characters for Shan and other languages. |
| Myanmar Extended-B | Unicode block U+A9E0–U+A9FF containing additional characters for Shan and Pao languages. |
| NFC (Normalization Form Composed) | Unicode normalization form where characters are stored as precomposed units. Recommended for Myanmar text. |
| Normalization | Process of converting text to a standard form, including removing zero-width characters and applying Unicode normalization. |
| Zawgyi | Legacy font/encoding for Myanmar script that differs from Unicode. mySpellChecker can detect and convert Zawgyi text. |
Validation Pipeline
Pipeline Layers
Pipeline Layers
| Term | Definition |
|---|---|
| Syllable-First Approach | Validate at the syllable level first, since syllables can be identified without a dictionary, then move to word and context levels. |
| Validation Level | Configuration option specifying depth of checking: SYLLABLE or WORD (defined in ValidationLevel enum). |
| Word Validation | Layer 2 of the validation pipeline that checks words against the dictionary and generates suggestions. |
| Grammar Checking | Layer 2.5 validation that applies syntactic rules to detect grammatical errors. |
| Context Validation | Layer 3 of the validation pipeline that uses N-gram probabilities to detect real-word errors. |
| Semantic Validation | Optional deep validation using neural network models to understand meaning. |
Error Types
Error Types
| Term | Definition |
|---|---|
| Real-Word Error | A spelling error where the misspelled word is itself a valid word but wrong in context. |
Algorithms
Spelling Correction
Spelling Correction
| Term | Definition |
|---|---|
| SymSpell | Algorithm for extremely fast spelling correction using symmetric delete operations. |
| Edit Distance | The minimum number of single-character operations (insert, delete, substitute) needed to transform one string into another. |
| Levenshtein Distance | Edit distance metric measuring single-character insertions, deletions, and substitutions. |
| Damerau-Levenshtein Distance | Edit distance metric that includes transposition as a single operation. Used for generating spelling suggestions. |
Context & Tagging
Context & Tagging
| Term | Definition |
|---|---|
| N-gram | A contiguous sequence of N items (syllables or words). Used in context validation. |
| Bigram | A sequence of two consecutive tokens (syllables or words) used for context analysis. |
| Trigram | A sequence of three consecutive tokens used for context analysis. |
| Part-of-Speech (POS) Tagging | Process of marking words with their grammatical category (noun, verb, particle, etc.). |
| Viterbi Algorithm | Dynamic programming algorithm used for POS tagging to find the most likely sequence of tags. |
Models & Inference
Models & Inference
| Term | Definition |
|---|---|
| ONNX | Open Neural Network Exchange format used for semantic model deployment. |
Data & Infrastructure
Storage & Processing
Storage & Processing
| Term | Definition |
|---|---|
| Dictionary Provider | Pluggable storage backend for dictionary data. Implementations include SQLite, Memory, JSON. |
| Frequency | The count of how often a word or syllable appears in a corpus. |
| Segmentation | Process of breaking text into meaningful units (syllables or words). |
Development Tools
Development Tools
| Term | Definition |
|---|---|
| Cython | A programming language that makes writing C extensions for Python easy. Used in mySpellChecker for performance-critical paths. |
| OpenMP | API for parallel programming. Used in Cython extensions for batch processing. |