Segmentation - mySpellChecker

Before any validation can occur, continuous Myanmar text must be split into meaningful units — syllables, words, and sentences. This page covers the pluggable segmenter interface, the default hybrid implementation, joint segmentation-tagging mode, and how to bring your own segmentation logic.

The Segmenter Interface

mySpellChecker uses a pluggable Segmenter abstract base class (src/myspellchecker/segmenters/base.py).

class Segmenter(ABC):
    @abstractmethod
    def segment_words(self, text: str) -> List[str]:
        """Break text into words."""
        pass

    @abstractmethod
    def segment_syllables(self, text: str) -> List[str]:
        """Break text into syllables."""
        pass

    @abstractmethod
    def segment_sentences(self, text: str) -> List[str]:
        """Break text into sentences."""
        pass

Default Implementation

The default implementation (DefaultSegmenter) uses a Hybrid Approach combining rule-based speed with ML-powered accuracy.

1. Syllable Segmentation (Optimized)

Algorithm: Uses a Cython-optimized C++ implementation of syllable breaking rules (adapted from Sylbreak) for maximum performance. Falls back to Regex matching if the extension is unavailable.
Noise Reduction:
- Unicode Normalization: Automatically applies NFC normalization and reorders diacritics (e.g., Medial Ra after Virama) using fast C++ routines.
- Validity Filtering: Checks against strict linguistical rules to discard “junk” syllables (e.g., floating diacritics, impossible stacks).
Performance: Fast (<1ms/sentence typical).

2. Sentence & Word Segmentation (Pluggable Engines)

The segmenter supports multiple engines for word segmentation:

myWord (Default & Recommended): Uses a Viterbi algorithm with unigram/bigram probabilities. It provides the best balance of speed and accuracy, especially for particle handling.
CRF: Uses Conditional Random Fields (via python-crfsuite). Good for general text but slower than myWord.

Fallback: If dependencies are missing, it degrades to basic whitespace/punctuation splitting.

Joint Segmentation and POS Tagging

For advanced use cases, mySpellChecker offers joint segmentation-tagging - a unified Viterbi decoder that simultaneously optimizes word boundaries AND POS tags in a single pass.

Benefits Over Sequential Processing

The traditional pipeline segments text first, then tags the resulting words. This can lead to suboptimal results when:

Segmentation ambiguity depends on POS context (e.g., particle boundaries)
Multiple valid segmentations exist with different tag sequences

Joint mode solves this by considering both objectives together:

argmax P(words, tags | text)
  = argmax Π P(word_i) × P(tag_i | tags) × P(tag_i | word_i)

Enabling Joint Mode

from myspellchecker.core.config import SpellCheckerConfig, JointConfig

config = SpellCheckerConfig(
    joint=JointConfig(
        enabled=True,
        beam_width=15,         # Larger beam for joint state space
        max_word_length=20,    # Maximum word length in characters
        emission_weight=1.2,   # Weight for P(tag|word) emissions
        word_score_weight=1.0, # Weight for word n-gram scores
    )
)

checker = SpellChecker(config=config, provider=provider)
words, tags = checker.segment_and_tag("မြန်မာနိုင်ငံ")
# words: ['မြန်မာ', 'နိုင်ငံ']
# tags: ['N', 'N']

Direct Usage

You can also use the JointSegmentTagger class directly:

from myspellchecker.algorithms import JointSegmentTagger

tagger = JointSegmentTagger(
    provider=provider,
    pos_bigram_probs=bigram_probs,
    pos_trigram_probs=trigram_probs,
    pos_unigram_probs=unigram_probs,
)

words, tags = tagger.segment_and_tag("ပညာသင်ကြားရေး")

Configuration Options

Parameter	Default	Description
`enabled`	`False`	Enable joint segmentation-tagging mode
`beam_width`	`15`	Number of hypotheses to keep (higher = better but slower)
`max_word_length`	`20`	Maximum word length in characters
`emission_weight`	`1.2`	Weight for tag emission probabilities
`word_score_weight`	`1.0`	Weight for word N-gram scores
`min_prob`	`1e-10`	Minimum probability for smoothing
`use_morphology_fallback`	`True`	Use morphology analysis for OOV words

Custom Segmenters

You can bring your own segmentation logic by subclassing Segmenter and passing it to SpellChecker.

from typing import List
from myspellchecker.segmenters import Segmenter

class MyCustomSegmenter(Segmenter):
    def segment_syllables(self, text: str) -> List[str]:
        # Required: Implement syllable segmentation
        return list(text)  # Naive example

    def segment_words(self, text: str) -> List[str]:
        # Required: Implement word segmentation
        return text.split(" ")

    def segment_sentences(self, text: str) -> List[str]:
        # Required: Implement sentence segmentation
        return text.split("။")  # Myanmar sentence terminal

checker = SpellChecker(segmenter=MyCustomSegmenter())

​The Segmenter Interface

​Default Implementation

​1. Syllable Segmentation (Optimized)

​2. Sentence & Word Segmentation (Pluggable Engines)

​Joint Segmentation and POS Tagging

​Benefits Over Sequential Processing

​Enabling Joint Mode

​Direct Usage

​Configuration Options

​Custom Segmenters