Segmentation Repair - mySpellChecker

Word segmenters occasionally split a syllable across two tokens — for example, breaking “ကျောင်း” into [“ကျော”, “င်း”]. This module detects such fragments by checking syllable validity, then merges them back with the preceding token while guarding against double-ending artifacts.

Overview

from myspellchecker.data_pipeline.repair import SegmentationRepair

repairer = SegmentationRepair()

# Fix broken segmentation
tokens = ["ကျော", "င်း", "သည်"]
repaired = repairer.repair(tokens)
print(repaired)  # ["ကျောင်း", "သည်"]

Problem Statement

Myanmar word segmentation can produce invalid splits:

Input text: "ကျောင်းသည်"
Bad segmentation: ["ကျော", "င်း", "သည်"]
                    ↑       ↑
                    Broken syllable "ကျောင်း"

Good segmentation: ["ကျောင်း", "သည်"]

The repair module detects and fixes these splits by:

Identifying invalid syllable fragments
Merging fragments with preceding words
Validating merged results

SegmentationRepair Class

class SegmentationRepair:
    """Repairs invalid word segmentations by merging broken syllables."""

    def __init__(self):
        self.validator = SyllableRuleValidator(allow_extended_myanmar=allow_extended_myanmar)

        # Pattern for valid syllable starters
        self.valid_start_pattern = re.compile(
            rf"^[{''.join(CONSONANTS)}{''.join(INDEPENDENT_VOWELS)}{''.join(MYANMAR_NUMERALS)}]"
        )

        # Pattern for Myanmar numerals (skip validation)
        self.numeral_start_pattern = re.compile(
            rf"^[{''.join(MYANMAR_NUMERALS)}]"
        )

        # Known problematic fragments
        self.suspicious_fragments = {
            "င်း",  # Part of ောင်း, ိုင်း
            "င့်",  # Part of ောင့်
            "န့်",  # Part of ောင့်
        }

        # Characters indicating closed syllable
        self.closing_chars = {"း", "့", "်"}

        # Fragments causing double-ending
        self.double_ending_fragments = {"င်း", "င့်", "န့်"}

Repair Algorithm

Detection Criteria

A token is identified as a fragment if:

Invalid Start: Starts with a dependent character (not consonant/vowel/numeral)
Suspicious Fragment: In the known problematic fragments list
Invalid Syllable: Fails syllable structure validation

def repair(self, tokens: List[str]) -> List[str]:
    """Repair a list of tokens by merging invalid fragments."""
    if not tokens:
        return []

    repaired = [tokens[0]]

    for i in range(1, len(tokens)):
        current_token = tokens[i]
        prev_token = repaired[-1]

        # Check if current token is a fragment
        is_invalid_start = not self.valid_start_pattern.match(current_token)
        is_suspicious = current_token in self.suspicious_fragments
        is_numeral = self.numeral_start_pattern.match(current_token)

        # Skip syllable validation for numerals
        is_invalid_syllable = False
        if not is_invalid_start and not is_suspicious and not is_numeral:
            is_invalid_syllable = not self.validator.validate(current_token)

        if is_invalid_start or is_suspicious or is_invalid_syllable:
            # Attempt merge with previous token
            # ... (merge logic)
        else:
            # Valid token, keep separate
            repaired.append(current_token)

    return repaired

Merge Rules

Fragments are merged with the previous token if all of these conditions are met (this logic is inline in repair(), not a separate method):

Merge wouldn’t create a double-ending pattern (checked via _would_create_double_ending())
Previous token is not “closed” (doesn’t end with း, ့, ်)
Merged result passes syllable validation

# Inline merge logic in repair() (simplified):
if self._would_create_double_ending(prev_token, current_token):
    repaired.append(current_token)  # Reject merge
    continue

prev_is_closed = prev_token and prev_token[-1] in self.closing_chars
if prev_is_closed:
    repaired.append(current_token)  # Reject merge
    continue

candidate = prev_token + current_token
if self.validator.validate(candidate):
    repaired[-1] = candidate        # Accept merge
else:
    repaired.append(current_token)  # Reject merge

Double-Ending Prevention

Prevents invalid merges like “တွင်” + “င်း” → “တွင်င်း”:

def _would_create_double_ending(self, prev_token: str, current_token: str) -> bool:
    """Check if merging would create invalid double-ending."""
    if not prev_token or not current_token:
        return False

    # Closed syllable + double-ending fragment = invalid
    if prev_token[-1] in self.closing_chars:
        if current_token in self.double_ending_fragments:
            return True

    return False

Examples

Basic Repair

repairer = SegmentationRepair()

# Fix broken syllable
tokens = ["ကျော", "င်း"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း"]

# Multiple fragments
tokens = ["မြန်", "မာ", "နိုင်", "င", "ံ"]
result = repairer.repair(tokens)
# Result: ["မြန်", "မာ", "နိုင်ငံ"]

Preserving Valid Tokens

# Valid tokens are preserved
tokens = ["ကျောင်း", "သည်", "ကောင်းသည်"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း", "သည်", "ကောင်းသည်"] (unchanged)

Numeral Handling

# Numerals are not merged
tokens = ["၁", "၂", "၃"]
result = repairer.repair(tokens)
# Result: ["၁", "၂", "၃"] (unchanged)

# Mixed text and numerals
tokens = ["အမှတ်", "၁", "၀", "၀"]
result = repairer.repair(tokens)
# Result: ["အမှတ်", "၁", "၀", "၀"]

Closed Syllable Protection

# Don't merge with closed syllables
tokens = ["ပြီး", "င်း"]
result = repairer.repair(tokens)
# Result: ["ပြီး", "င်း"] (not merged - "ပြီး" is closed)

# Valid merge with open syllable
tokens = ["ကျော", "င်း"]
result = repairer.repair(tokens)
# Result: ["ကျောင်း"] (merged - "ကျော" is open)

Integration

With Data Pipeline

from myspellchecker.data_pipeline.repair import SegmentationRepair

def process_corpus(lines: List[str]) -> List[List[str]]:
    """Process corpus with segmentation repair."""
    from myspellchecker.segmenters import DefaultSegmenter
    segmenter = DefaultSegmenter()
    repairer = SegmentationRepair()

    results = []
    for line in lines:
        words = segmenter.segment_words(line)
        repaired = repairer.repair(words)
        results.append(repaired)

    return results

With SpellChecker

Segmentation repair is handled internally by the data pipeline’s CorpusSegmenter. It is not a user-configurable option in SpellCheckerConfig. The repair logic runs automatically during dictionary building to fix common segmentation errors.

# Repair is used internally during pipeline builds:
from myspellchecker.data_pipeline.segmenter import CorpusSegmenter

segmenter = CorpusSegmenter(word_engine="myword")
# CorpusSegmenter applies repair logic internally during segmentation

Suspicious Fragments

Commonly misidentified fragments:

Fragment	Source	Correct Form
`င်း`	Part of ောင်း, ိုင်း	ကျောင်း, တိုင်း
`င့်`	Part of ောင့်	ကြောင့်
`န့်`	Part of various	dependent

Closing Characters

Characters that indicate a syllable is complete:

Character	Name	Unicode
`း`	Visarga	U+1038
`့`	Dot below	U+1037
`်`	Asat	U+103A

Performance

Operation	Time	Notes
Repair (10 tokens)	<1ms	Typical sentence
Repair (100 tokens)	~5ms	Long paragraph
Validation per token	<0.1ms	Cached regex

Cython Optimization

For high-performance scenarios, use the Cython version:

try:
    from myspellchecker.data_pipeline.repair_c import CythonSegmentationRepair as SegmentationRepair
except ImportError:
    from myspellchecker.data_pipeline.repair import SegmentationRepair

# Same API, faster execution
repairer = SegmentationRepair()
repaired = repairer.repair(tokens)

​Overview

​Problem Statement

​SegmentationRepair Class

​Repair Algorithm

​Detection Criteria

​Merge Rules

​Double-Ending Prevention

​Examples

​Basic Repair

​Preserving Valid Tokens

​Numeral Handling

​Closed Syllable Protection

​Integration

​With Data Pipeline

​With SpellChecker

​Suspicious Fragments

​Closing Characters

​Performance

​Cython Optimization

​See Also