Overview
Problem Statement
Myanmar word segmentation can produce invalid splits:- Identifying invalid syllable fragments
- Merging fragments with preceding words
- Validating merged results
SegmentationRepair Class
Repair Algorithm
Detection Criteria
A token is identified as a fragment if:- Invalid Start: Starts with a dependent character (not consonant/vowel/numeral)
- Suspicious Fragment: In the known problematic fragments list
- Invalid Syllable: Fails syllable structure validation
Merge Rules
Fragments are merged with the previous token if all of these conditions are met (this logic is inline inrepair(), not a separate method):
- Merge wouldn’t create a double-ending pattern (checked via
_would_create_double_ending()) - Previous token is not “closed” (doesn’t end with း, ့, ်)
- Merged result passes syllable validation
Double-Ending Prevention
Prevents invalid merges like “တွင်” + “င်း” → “တွင်င်း”:Examples
Basic Repair
Preserving Valid Tokens
Numeral Handling
Closed Syllable Protection
Integration
With Data Pipeline
With SpellChecker
Segmentation repair is handled internally by the data pipeline’sCorpusSegmenter.
It is not a user-configurable option in SpellCheckerConfig. The repair logic runs
automatically during dictionary building to fix common segmentation errors.
Suspicious Fragments
Commonly misidentified fragments:| Fragment | Source | Correct Form |
|---|---|---|
င်း | Part of ောင်း, ိုင်း | ကျောင်း, တိုင်း |
င့် | Part of ောင့် | ကြောင့် |
န့် | Part of various | dependent |
Closing Characters
Characters that indicate a syllable is complete:| Character | Name | Unicode |
|---|---|---|
း | Visarga | U+1038 |
့ | Dot below | U+1037 |
် | Asat | U+103A |
Performance
| Operation | Time | Notes |
|---|---|---|
| Repair (10 tokens) | <1ms | Typical sentence |
| Repair (100 tokens) | ~5ms | Long paragraph |
| Validation per token | <0.1ms | Cached regex |
Cython Optimization
For high-performance scenarios, use the Cython version:See Also
- Segmenters - Text segmentation
- Syllable Validation - Syllable rules
- Batch Processing - Parallel processing
- Data Pipeline - Full pipeline guide