Skip to main content
Myanmar text segmentation operates at two levels — syllable and word — because Myanmar script has no spaces between words. mySpellChecker separates these concerns: a fast regex-based segmenter handles syllables, while a configurable word engine (myword, CRF, or transformer) handles word boundaries.

Architecture

Segmentation architecture: Raw Myanmar Text forks into Syllable Segmentation (RegexSegmenter, rule-based) and Word Segmentation (DefaultSegmenter with myword/CRF/transformer engine)
LevelDefault EngineHow It WorksDownloads
SyllableRegexSegmenterRegex-based Sylbreak algorithmNone
WordmywordViterbi with unigram/bigram dictionarysegmentation.mmap from HuggingFace

Quick Start

from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter()

# Syllable segmentation (RegexSegmenter internally, no download)
syllables = segmenter.segment_syllables("မြန်မာနိုင်ငံ")
# ['မြန်', 'မာ', 'နိုင်', 'ငံ']

# Word segmentation (myword engine, downloads dictionary on first call)
words = segmenter.segment_words("မြန်မာနိုင်ငံသည်")
# ['မြန်မာ', 'နိုင်ငံ', 'သည်']

# Sentence segmentation
sentences = segmenter.segment_sentences("ပထမစာ။ ဒုတိယစာ။")
# ['ပထမစာ။', 'ဒုတိယစာ။']

Syllable Segmentation

RegexSegmenter

All syllable segmentation in mySpellChecker uses RegexSegmenter — a pure-Python, rule-based segmenter with zero external dependencies and no network downloads.
from myspellchecker.segmenters import RegexSegmenter

segmenter = RegexSegmenter(
    allow_extended_myanmar=False,  # Include Extended Myanmar Unicode blocks (default: False)
)

syllables = segmenter.segment_syllables("မြန်မာစကား")
# ['မြန်', 'မာ', 'စ', 'ကား']
Characteristics:
  • Pure Python with optional Cython acceleration
  • No downloads, no model, no dictionary needed
  • Fork-safe for multiprocessing
  • Handles stacked consonants (Virama ္), Kinzi sequences, and non-Myanmar text
RegexSegmenter only supports syllable and sentence segmentation. It raises NotImplementedError for segment_words(). Use DefaultSegmenter for word segmentation.

Sylbreak Algorithm

The segmenter uses an adapted Sylbreak algorithm:
# Pattern components:
# 1. Myanmar consonant not preceded by stacking Virama
p_my_cons = r"(?<!(?<!\u103a)\u1039)[\u1000-\u1021](?![\u103a\u1039])"

# 2. Independent vowels, digits, symbols
p_other_starters = r"[\u1023-\u102a\u103f\u104c-\u104f\u1040-\u1049\u104a\u104b]"

# 3. Non-Myanmar characters (grouped)
p_non_myanmar = r"[^\u1000-\u109F]+"

Cython Acceleration

RegexSegmenter automatically uses a Cython-compiled implementation when available:
from myspellchecker.segmenters.regex import _HAS_CYTHON_SEGMENTER

if _HAS_CYTHON_SEGMENTER:
    print("Using fast Cython implementation")
else:
    print("Using pure Python implementation")

Word Segmentation

Word segmentation is handled by DefaultSegmenter, which delegates to one of three word engines. Both myword and crf download resources from HuggingFace on first use.

Word Engines

EngineAccuracySpeedModel SourceDependencies
myword (default)~90%Fastsegmentation.mmap from HuggingFaceNone (pure Python + Cython)
crf~92%Mediumwordseg_c2_crf.crfsuite from HuggingFacepycrfsuite
transformer~96%Slow (CPU) / Fast (GPU)chuuhtetnaing/myanmar-text-segmentation-modeltransformers, torch

Engine Selection

from myspellchecker.segmenters import DefaultSegmenter

# myword (default) — Viterbi with unigram/bigram dictionary
segmenter = DefaultSegmenter(word_engine="myword")

# CRF — Conditional Random Field sequence tagger
segmenter = DefaultSegmenter(word_engine="crf")

# Transformer — XLM-RoBERTa fine-tuned for word boundaries
segmenter = DefaultSegmenter(
    word_engine="transformer",
    seg_model="chuuhtetnaing/myanmar-text-segmentation-model",  # default
    seg_device=-1,   # -1=CPU, 0+=GPU
)

HuggingFace Resource Downloads

The myword and crf engines download their resources from the thettwe/myspellchecker-resources HuggingFace dataset repository on first use:
EngineResourceFileSize
mywordWord segmentation dictionarysegmentation/segmentation.mmapMemory-mapped
crfCRF modelmodels/wordseg_c2_crf.crfsuiteCRF model file
Resources are cached at ~/.cache/myspellchecker/resources/ and only downloaded once.
# Override cache directory
export MYSPELL_CACHE_DIR="/path/to/cache"

# Prevent network downloads (fail if resource not cached)
export MYSPELL_OFFLINE=true
Word segmenters use lazy initialization — no download occurs when you create a DefaultSegmenter or SpellChecker. The download happens on the first call to segment_words().

myword Engine

The default word segmentation engine, based on myWord by Ye Kyaw Thu. Uses a Viterbi algorithm with unigram and bigram probabilities from a memory-mapped dictionary.
segmenter = DefaultSegmenter(word_engine="myword")

words = segmenter.segment_words("မြန်မာနိုင်ငံသည်ကောင်းသည်")
# ['မြန်မာ', 'နိုင်ငံ', 'သည်', 'ကောင်း', 'သည်']

# Load additional custom words into the myword dictionary
segmenter.load_custom_dictionary(["ကျွန်တော်တို့", "မိသားစု"])

CRF Engine

CRF-based sequence tagger trained on myPOS corpus by Ye Kyaw Thu. Requires pycrfsuite.
pip install pycrfsuite
segmenter = DefaultSegmenter(word_engine="crf")
words = segmenter.segment_words("မြန်မာနိုင်ငံသည်")

Transformer Engine

XLM-RoBERTa model fine-tuned for Myanmar word boundary detection by Chuu Htet Naing. Uses B/I (Begin/Inside) token classification.
pip install myspellchecker[transformers]
segmenter = DefaultSegmenter(
    word_engine="transformer",
    seg_device=0,  # GPU for speed
)
words = segmenter.segment_words("မြန်မာနိုင်ငံသည်")
AttributeValue
Modelchuuhtetnaing/myanmar-text-segmentation-model
BaseXLM-RoBERTa
Accuracy96.17%
F1 Score78.66%

Segmenter Interface

All segmenters implement the Segmenter abstract base class:
from myspellchecker.segmenters.base import Segmenter

class Segmenter(ABC):
    @abstractmethod
    def segment_syllables(self, text: str) -> list[str]:
        """Segment text into syllables."""

    @abstractmethod
    def segment_words(self, text: str) -> list[str]:
        """Segment text into words."""

    @abstractmethod
    def segment_sentences(self, text: str) -> list[str]:
        """Segment text into sentences."""

    def segment_and_tag(self, text: str) -> tuple[list[str], list[str]]:
        """Segment and POS-tag simultaneously. Optional — raises NotImplementedError by default."""
        raise NotImplementedError

DefaultSegmenter

The production segmenter that combines RegexSegmenter (syllables) with a configurable word engine:
from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter(
    word_engine="myword",           # "myword" | "crf" | "transformer"
    allow_extended_myanmar=False,   # Extended Myanmar Unicode blocks
    seg_model=None,                 # Custom model path (transformer only)
    seg_device=-1,                  # -1=CPU, 0+=GPU (transformer only)
)

Usage with SpellChecker

Via Configuration

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    word_engine="myword",  # or "crf" or "transformer"
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

Custom Segmenter

from myspellchecker import SpellChecker
from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter(word_engine="crf")
checker = SpellChecker(segmenter=segmenter)

Via Builder

from myspellchecker.core.builder import SpellCheckerBuilder
from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter(word_engine="crf")
checker = SpellCheckerBuilder().with_segmenter(segmenter).build()

Performance Comparison

SegmenterLevelSpeedMemoryDependencies
RegexSegmenterSyllableVery FastVery LowNone
DefaultSegmenter (myword)WordFastLowDownloads mmap dictionary
DefaultSegmenter (crf)WordMediumLowpycrfsuite + downloads CRF model
DefaultSegmenter (transformer)WordSlow (CPU) / Fast (GPU)High (~500MB)transformers, torch

Sentence Boundaries

All segmenters split on Myanmar sentence separator (။):
text = "ပထမစာကြောင်း။ ဒုတိယစာကြောင်း။ တတိယစာကြောင်း။"
sentences = segmenter.segment_sentences(text)
# ['ပထမစာကြောင်း။', 'ဒုတိယစာကြောင်း။', 'တတိယစာကြောင်း။']
DefaultSegmenter also detects sentence-final particles (SFPs) as implicit sentence boundaries in longer texts.

See Also