Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Word segmenters occasionally split a syllable across two tokens — for example, breaking “ကျောင်း” into [“ကျော”, “င်း”]. This module detects such fragments by checking syllable validity, then merges them back with the preceding token while guarding against double-ending artifacts.
Overview
from myspellchecker.data_pipeline.repair import SegmentationRepair
repairer = SegmentationRepair()
# Fix broken segmentation
tokens = ["ကျော", "င်း", "သည်"]
repaired = repairer.repair(tokens)
print(repaired) # ["ကျောင်း", "သည်"]
Problem Statement
Myanmar word segmentation can produce invalid splits:
Input text: "ကျောင်းသည်"
Bad segmentation: ["ကျော", "င်း", "သည်"]
↑ ↑
Broken syllable "ကျောင်း"
Good segmentation: ["ကျောင်း", "သည်"]
The repair module detects and fixes these splits by:
- Identifying invalid syllable fragments
- Merging fragments with preceding words
- Validating merged results
SegmentationRepair Class
class SegmentationRepair:
"""Repairs invalid word segmentations by merging broken syllables."""
def __init__(self):
self.validator = SyllableRuleValidator(allow_extended_myanmar=allow_extended_myanmar)
# Pattern for valid syllable starters
self.valid_start_pattern = re.compile(
rf"^[{''.join(CONSONANTS)}{''.join(INDEPENDENT_VOWELS)}{''.join(MYANMAR_NUMERALS)}]"
)
# Pattern for Myanmar numerals (skip validation)
self.numeral_start_pattern = re.compile(
rf"^[{''.join(MYANMAR_NUMERALS)}]"
)
# Known problematic fragments
self.suspicious_fragments = {
"င်း", # Part of ောင်း, ိုင်း
"င့်", # Part of ောင့်
"န့်", # Part of ောင့်
}
# Characters indicating closed syllable
self.closing_chars = {"း", "့", "်"}
# Fragments causing double-ending
self.double_ending_fragments = {"င်း", "င့်", "န့်"}
Repair Algorithm
Detection Criteria
A token is identified as a fragment if:
- Invalid Start: Starts with a dependent character (not consonant/vowel/numeral)
- Suspicious Fragment: In the known problematic fragments list
- Invalid Syllable: Fails syllable structure validation
def repair(self, tokens: List[str]) -> List[str]:
"""Repair a list of tokens by merging invalid fragments."""
if not tokens:
return []
repaired = [tokens[0]]
for i in range(1, len(tokens)):
current_token = tokens[i]
prev_token = repaired[-1]
# Check if current token is a fragment
is_invalid_start = not self.valid_start_pattern.match(current_token)
is_suspicious = current_token in self.suspicious_fragments
is_numeral = self.numeral_start_pattern.match(current_token)
# Skip syllable validation for numerals
is_invalid_syllable = False
if not is_invalid_start and not is_suspicious and not is_numeral:
is_invalid_syllable = not self.validator.validate(current_token)
if is_invalid_start or is_suspicious or is_invalid_syllable:
# Attempt merge with previous token
# ... (merge logic)
else:
# Valid token, keep separate
repaired.append(current_token)
return repaired
Merge Rules
Fragments are merged with the previous token if all of these conditions are met
(this logic is inline in repair(), not a separate method):
- Merge wouldn’t create a double-ending pattern (checked via
_would_create_double_ending())
- Previous token is not “closed” (doesn’t end with း, ့, ်)
- Merged result passes syllable validation
# Inline merge logic in repair() (simplified):
if self._would_create_double_ending(prev_token, current_token):
repaired.append(current_token) # Reject merge
continue
prev_is_closed = prev_token and prev_token[-1] in self.closing_chars
if prev_is_closed:
repaired.append(current_token) # Reject merge
continue
candidate = prev_token + current_token
if self.validator.validate(candidate):
repaired[-1] = candidate # Accept merge
else:
repaired.append(current_token) # Reject merge
Double-Ending Prevention
Prevents invalid merges like “တွင်” + “င်း” → “တွင်င်း”:
def _would_create_double_ending(self, prev_token: str, current_token: str) -> bool:
"""Check if merging would create invalid double-ending."""
if not prev_token or not current_token:
return False
# Closed syllable + double-ending fragment = invalid
if prev_token[-1] in self.closing_chars:
if current_token in self.double_ending_fragments:
return True
return False
Examples
Basic Repair
repairer = SegmentationRepair()
# Fix broken syllable
tokens = ["ကျော", "င်း"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း"]
# Multiple fragments
tokens = ["မြန်", "မာ", "နိုင်", "င", "ံ"]
result = repairer.repair(tokens)
# Result: ["မြန်", "မာ", "နိုင်ငံ"]
Preserving Valid Tokens
# Valid tokens are preserved
tokens = ["ကျောင်း", "သည်", "ကောင်းသည်"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း", "သည်", "ကောင်းသည်"] (unchanged)
Numeral Handling
# Numerals are not merged
tokens = ["၁", "၂", "၃"]
result = repairer.repair(tokens)
# Result: ["၁", "၂", "၃"] (unchanged)
# Mixed text and numerals
tokens = ["အမှတ်", "၁", "၀", "၀"]
result = repairer.repair(tokens)
# Result: ["အမှတ်", "၁", "၀", "၀"]
Closed Syllable Protection
# Don't merge with closed syllables
tokens = ["ပြီး", "င်း"]
result = repairer.repair(tokens)
# Result: ["ပြီး", "င်း"] (not merged - "ပြီး" is closed)
# Valid merge with open syllable
tokens = ["ကျော", "င်း"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း"] (merged - "ကျော" is open)
Integration
With Data Pipeline
from myspellchecker.data_pipeline.repair import SegmentationRepair
def process_corpus(lines: List[str]) -> List[List[str]]:
"""Process corpus with segmentation repair."""
from myspellchecker.segmenters import DefaultSegmenter
segmenter = DefaultSegmenter()
repairer = SegmentationRepair()
results = []
for line in lines:
words = segmenter.segment_words(line)
repaired = repairer.repair(words)
results.append(repaired)
return results
With SpellChecker
Segmentation repair is handled internally by the data pipeline’s CorpusSegmenter.
It is not a user-configurable option in SpellCheckerConfig. The repair logic runs
automatically during dictionary building to fix common segmentation errors.
# Repair is used internally during pipeline builds:
from myspellchecker.data_pipeline.segmenter import CorpusSegmenter
segmenter = CorpusSegmenter(word_engine="myword")
# CorpusSegmenter applies repair logic internally during segmentation
Suspicious Fragments
Commonly misidentified fragments:
| Fragment | Source | Correct Form |
|---|
င်း | Part of ောင်း, ိုင်း | ကျောင်း, တိုင်း |
င့် | Part of ောင့် | ကြောင့် |
န့် | Part of various | dependent |
Closing Characters
Characters that indicate a syllable is complete:
| Character | Name | Unicode |
|---|
း | Visarga | U+1038 |
့ | Dot below | U+1037 |
် | Asat | U+103A |
| Operation | Time | Notes |
|---|
| Repair (10 tokens) | <1ms | Typical sentence |
| Repair (100 tokens) | ~5ms | Long paragraph |
| Validation per token | <0.1ms | Cached regex |
Cython Optimization
For high-performance scenarios, use the Cython version:
try:
from myspellchecker.data_pipeline.repair_c import CythonSegmentationRepair as SegmentationRepair
except ImportError:
from myspellchecker.data_pipeline.repair import SegmentationRepair
# Same API, faster execution
repairer = SegmentationRepair()
repaired = repairer.repair(tokens)
See Also