Skip to main content
Word segmenters occasionally split a syllable across two tokens — for example, breaking “ကျောင်း” into [“ကျော”, “င်း”]. This module detects such fragments by checking syllable validity, then merges them back with the preceding token while guarding against double-ending artifacts.

Overview

from myspellchecker.data_pipeline.repair import SegmentationRepair

repairer = SegmentationRepair()

# Fix broken segmentation
tokens = ["ကျော", "င်း", "သည်"]
repaired = repairer.repair(tokens)
print(repaired)  # ["ကျောင်း", "သည်"]

Problem Statement

Myanmar word segmentation can produce invalid splits:
Input text: "ကျောင်းသည်"
Bad segmentation: ["ကျော", "င်း", "သည်"]
                    ↑       ↑
                    Broken syllable "ကျောင်း"

Good segmentation: ["ကျောင်း", "သည်"]
The repair module detects and fixes these splits by:
  1. Identifying invalid syllable fragments
  2. Merging fragments with preceding words
  3. Validating merged results

SegmentationRepair Class

class SegmentationRepair:
    """Repairs invalid word segmentations by merging broken syllables."""

    def __init__(self):
        self.validator = SyllableRuleValidator()

        # Pattern for valid syllable starters
        self.valid_start_pattern = re.compile(
            rf"^[{CONSONANTS}{INDEPENDENT_VOWELS}{MYANMAR_NUMERALS}]"
        )

        # Pattern for Myanmar numerals (skip validation)
        self.numeral_start_pattern = re.compile(
            rf"^[{MYANMAR_NUMERALS}]"
        )

        # Known problematic fragments
        self.suspicious_fragments = {
            "င်း",  # Part of ောင်း, ိုင်း
            "င့်",  # Part of ောင့်
            "န့်",  # Part of ောင့်
        }

        # Characters indicating closed syllable
        self.closing_chars = {"း", "့", "်"}

        # Fragments causing double-ending
        self.double_ending_fragments = {"င်း", "င့်", "န့်"}

Repair Algorithm

Detection Criteria

A token is identified as a fragment if:
  1. Invalid Start: Starts with a dependent character (not consonant/vowel/numeral)
  2. Suspicious Fragment: In the known problematic fragments list
  3. Invalid Syllable: Fails syllable structure validation
def repair(self, tokens: List[str]) -> List[str]:
    """Repair a list of tokens by merging invalid fragments."""
    if not tokens:
        return []

    repaired = [tokens[0]]

    for i in range(1, len(tokens)):
        current_token = tokens[i]
        prev_token = repaired[-1]

        # Check if current token is a fragment
        is_invalid_start = not self.valid_start_pattern.match(current_token)
        is_suspicious = current_token in self.suspicious_fragments
        is_numeral = self.numeral_start_pattern.match(current_token)

        # Skip syllable validation for numerals
        is_invalid_syllable = False
        if not is_invalid_start and not is_suspicious and not is_numeral:
            is_invalid_syllable = not self.validator.validate(current_token)

        if is_invalid_start or is_suspicious or is_invalid_syllable:
            # Attempt merge with previous token
            # ... (merge logic)
        else:
            # Valid token, keep separate
            repaired.append(current_token)

    return repaired

Merge Rules

Fragments are merged with the previous token if all of these conditions are met (this logic is inline in repair(), not a separate method):
  1. Merge wouldn’t create a double-ending pattern (checked via _would_create_double_ending())
  2. Previous token is not “closed” (doesn’t end with း, ့, ်)
  3. Merged result passes syllable validation
# Inline merge logic in repair() (simplified):
if self._would_create_double_ending(prev_token, current_token):
    repaired.append(current_token)  # Reject merge
    continue

prev_is_closed = prev_token and prev_token[-1] in self.closing_chars
if prev_is_closed:
    repaired.append(current_token)  # Reject merge
    continue

candidate = prev_token + current_token
if self.validator.validate(candidate):
    repaired[-1] = candidate        # Accept merge
else:
    repaired.append(current_token)  # Reject merge

Double-Ending Prevention

Prevents invalid merges like “တွင်” + “င်း” → “တွင်င်း”:
def _would_create_double_ending(self, prev_token: str, current_token: str) -> bool:
    """Check if merging would create invalid double-ending."""
    if not prev_token or not current_token:
        return False

    # Closed syllable + double-ending fragment = invalid
    if prev_token[-1] in self.closing_chars:
        if current_token in self.double_ending_fragments:
            return True

    return False

Examples

Basic Repair

repairer = SegmentationRepair()

# Fix broken syllable
tokens = ["ကျော", "င်း"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း"]

# Multiple fragments
tokens = ["မြန်", "မာ", "နိုင်", "င", "ံ"]
result = repairer.repair(tokens)
# Result: ["မြန်", "မာ", "နိုင်ငံ"]

Preserving Valid Tokens

# Valid tokens are preserved
tokens = ["ကျောင်း", "သည်", "ကောင်းသည်"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း", "သည်", "ကောင်းသည်"] (unchanged)

Numeral Handling

# Numerals are not merged
tokens = ["၁", "၂", "၃"]
result = repairer.repair(tokens)
# Result: ["၁", "၂", "၃"] (unchanged)

# Mixed text and numerals
tokens = ["အမှတ်", "၁", "၀", "၀"]
result = repairer.repair(tokens)
# Result: ["အမှတ်", "၁", "၀", "၀"]

Closed Syllable Protection

# Don't merge with closed syllables
tokens = ["ပြီး", "င်း"]
result = repairer.repair(tokens)
# Result: ["ပြီး", "င်း"] (not merged - "ပြီး" is closed)

# Valid merge with open syllable
tokens = ["ကျော", "င်း"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း"] (merged - "ကျော" is open)

Integration

With Worker Context

from myspellchecker.data_pipeline.worker_context import WorkerContext, set_worker_context
from myspellchecker.data_pipeline.repair import SegmentationRepair

# Set up in worker
context = WorkerContext(
    segmenter=segmenter,
    repairer=SegmentationRepair(),
)
set_worker_context(context)

# Use in processing
def process_line(line: str) -> List[str]:
    ctx = get_worker_context()
    words = ctx.segmenter.segment_words(line)
    return ctx.repairer.repair(words)

With Data Pipeline

from myspellchecker.data_pipeline.repair import SegmentationRepair

def process_corpus(lines: List[str]) -> List[List[str]]:
    """Process corpus with segmentation repair."""
    segmenter = DefaultSegmenter()
    repairer = SegmentationRepair()

    results = []
    for line in lines:
        words = segmenter.segment_words(line)
        repaired = repairer.repair(words)
        results.append(repaired)

    return results

With SpellChecker

Segmentation repair is handled internally by the data pipeline’s PipelineSegmenter. It is not a user-configurable option in SpellCheckerConfig. The repair logic runs automatically during dictionary building to fix common segmentation errors.
# Repair is used internally during pipeline builds:
from myspellchecker.data_pipeline.segmenter import PipelineSegmenter

segmenter = PipelineSegmenter(provider)
repaired = segmenter.repair_segmentation(["ကျောင်း", "သည်"])

Suspicious Fragments

Commonly misidentified fragments:
FragmentSourceCorrect Form
င်းPart of ောင်း, ိုင်းကျောင်း, တိုင်း
င့်Part of ောင့်ကြောင့်
န့်Part of variousdependent

Closing Characters

Characters that indicate a syllable is complete:
CharacterNameUnicode
VisargaU+1038
Dot belowU+1037
AsatU+103A

Performance

OperationTimeNotes
Repair (10 tokens)<1msTypical sentence
Repair (100 tokens)~5msLong paragraph
Validation per token<0.1msCached regex

Cython Optimization

For high-performance scenarios, use the Cython version:
try:
    from myspellchecker.data_pipeline.repair_c import SegmentationRepair
except ImportError:
    from myspellchecker.data_pipeline.repair import SegmentationRepair

# Same API, faster execution
repairer = SegmentationRepair()
repaired = repairer.repair(tokens)

See Also