Overview
Problem Statement
Myanmar word segmentation can produce invalid splits:- Identifying invalid syllable fragments
- Merging fragments with preceding words
- Validating merged results
SegmentationRepair Class
Repair Algorithm
Detection Criteria
A token is identified as a fragment if:- Invalid Start: Starts with a dependent character (not consonant/vowel/numeral)
- Suspicious Fragment: In the known problematic fragments list
- Invalid Syllable: Fails syllable structure validation
Merge Rules
Fragments are merged with the previous token if all of these conditions are met (this logic is inline inrepair(), not a separate method):
- Merge wouldn’t create a double-ending pattern (checked via
_would_create_double_ending()) - Previous token is not “closed” (doesn’t end with း, ့, ်)
- Merged result passes syllable validation
Double-Ending Prevention
Prevents invalid merges like “တွင်” + “င်း” → “တွင်င်း”:Examples
Basic Repair
Preserving Valid Tokens
Numeral Handling
Closed Syllable Protection
Integration
With Worker Context
With Data Pipeline
With SpellChecker
Segmentation repair is handled internally by the data pipeline’sPipelineSegmenter.
It is not a user-configurable option in SpellCheckerConfig. The repair logic runs
automatically during dictionary building to fix common segmentation errors.
Suspicious Fragments
Commonly misidentified fragments:| Fragment | Source | Correct Form |
|---|---|---|
င်း | Part of ောင်း, ိုင်း | ကျောင်း, တိုင်း |
င့် | Part of ောင့် | ကြောင့် |
န့် | Part of various | dependent |
Closing Characters
Characters that indicate a syllable is complete:| Character | Name | Unicode |
|---|---|---|
း | Visarga | U+1038 |
့ | Dot below | U+1037 |
် | Asat | U+103A |
Performance
| Operation | Time | Notes |
|---|---|---|
| Repair (10 tokens) | <1ms | Typical sentence |
| Repair (100 tokens) | ~5ms | Long paragraph |
| Validation per token | <0.1ms | Cached regex |
Cython Optimization
For high-performance scenarios, use the Cython version:See Also
- Segmenters - Text segmentation
- Syllable Validation - Syllable rules
- Worker Context - Multiprocessing context
- Data Pipeline - Full pipeline guide