tokenizers module provides low-level text splitting utilities for Myanmar text. Unlike Segmenters (which may involve complex logic and dictionary lookups), Tokenizers provide direct access to segmentation algorithms.
Overview
| Tokenizer | Algorithm | Purpose | Speed |
|---|---|---|---|
SyllableTokenizer | Regex-based | Split text into syllables | Very fast |
WordTokenizer | CRF or Viterbi | Split text into words | Fast |
TransformerWordSegmenter | HuggingFace token classification | Split text into words using B/I labels | Model-dependent |
SyllableTokenizer
A fast, regex-based tokenizer that splits Myanmar text into syllables using the Sylbreak algorithm rules.Initialization
Basic Usage
How It Works
The tokenizer uses regex patterns to identify syllable boundaries based on:- Myanmar consonants (U+1000-U+1021)
- Virama/Asat markers (္ and ်) for stacking detection
- Negative lookbehind to preserve stacked consonants
Internal Usage
SyllableTokenizer is the building block for:
WordTokenizer(inherits from it)FrequencyBuilder(data pipeline)
WordTokenizer
A word tokenizer supporting two segmentation engines:| Engine | Algorithm | Accuracy | Speed | Notes |
|---|---|---|---|---|
myword | Viterbi + mmap | ~95% | Fast | Recommended (default) |
CRF | CRF model | ~92% | Medium | Requires pycrfsuite |
Initialization
Basic Usage
Engine: myword (Viterbi)
Themyword engine uses a Viterbi algorithm with unigram/bigram probabilities stored in a memory-mapped file for fork-safe, high-performance segmentation.
Features:
- Memory-mapped dictionary (Copy-on-Write for multiprocessing)
- Cython-optimized Viterbi implementation
- Post-processing for fragment merging and numeral splitting
- Fragment merging: Merge invalid consonant+asat patterns
- Numeral splitting: Split word+numeral concatenations (e.g.,
လ၁→['လ', '၁']) - Re-merge: Handle fragments created by splitting
Engine: CRF
The CRF engine uses a trained Conditional Random Fields model for syllable-based word boundary detection. Features:- Uses pycrfsuite library
- Feature extraction includes bigrams, trigrams, BOS/EOS markers
- Good accuracy without requiring large dictionary files
Checking Custom Words
For themyword engine, you can check if words exist in the dictionary:
Zero/Wa Normalization
The tokenizer automatically normalizes Myanmar numeral zero (၀, U+1040) to letter wa (ဝ, U+101D) when not in numeric context:Cython Extensions
Performance-critical tokenization code uses Cython extensions:| Module | File | Purpose |
|---|---|---|
word_segment | tokenizers/cython/word_segment.pyx | Viterbi algorithm |
mmap_reader | tokenizers/cython/mmap_reader.pyx | Memory-mapped file access |
Checking Cython Status
Error Handling
TransformerWordSegmenter
A model-agnostic word segmenter that uses any HuggingFace token classification model with B/I (Beginning/Inside) labels to identify word boundaries in Myanmar text.Requirements
Requires the optionaltransformers dependency:
transformers>=4.30.0torch>=2.0.0
Initialization
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | Optional[str] | "chuuhtetnaing/myanmar-text-segmentation-model" | HuggingFace model ID or local path |
device | int | -1 | Device for inference. -1 for CPU, 0+ for GPU index |
batch_size | int | 32 | Batch size for segment_batch(). Auto-tuned to 64 on CPU if left at default |
max_length | int | 512 | Maximum sequence length for the tokenizer |
cache_dir | Optional[str] | None | Directory for caching downloaded models |
**pipeline_kwargs | Additional arguments passed to transformers.pipeline() |
Basic Usage
How It Works
The segmenter uses a HuggingFacetoken-classification pipeline with aggregation_strategy="simple". The model labels each token as:
- B (Beginning): Start of a new word
- I (Inside): Continuation of the current word
_merge_bi_tags() method groups consecutive B+I* sequences into complete words:
- I without preceding B: Treated as a new word start
- Unknown tag: Treated as B (new word start)
- Empty tokens: Skipped
Device Support
The segmenter supports CPU, CUDA GPU, and Apple Silicon MPS:| Device Value | Hardware | Notes |
|---|---|---|
-1 (default) | CPU | Always available, batch_size auto-tuned to 64 |
0 | CUDA GPU 0 | Requires CUDA-capable GPU and PyTorch with CUDA |
0 (on macOS) | MPS (Apple Silicon) | Auto-detected when CUDA unavailable but MPS available |
1, 2, … | CUDA GPU N | Falls back to CPU if GPU index unavailable |
- If a GPU index is requested but unavailable, falls back to CPU with a warning
- If PyTorch is not installed, falls back to CPU with a warning
Batch Processing
segment_batch() is significantly more efficient than calling segment() in a loop:
Data Pipeline Integration
The transformer engine integrates with the data pipeline for building dictionaries from corpus files.CLI Usage
CLI Flags
| Flag | Default | Description |
|---|---|---|
--word-engine transformer | myword | Select the transformer segmentation engine |
--seg-model MODEL | chuuhtetnaing/myanmar-text-segmentation-model | HuggingFace model ID or local path |
--seg-device DEVICE | -1 (CPU) | Device for inference. -1 for CPU, 0+ for GPU |
Python API
Pipeline Processing Behavior
When using the transformer engine, the pipeline processes chunks sequentially in the main process rather than using multiprocessing. This is because PyTorch’s internal C++ state (thread pools, memory allocators, CUDA contexts) does not survivefork() and loading the model in each spawned worker would be impractical (~1.1GB per worker).
The pipeline automatically:
- Loads the transformer model once in the main process
- Processes chunks sequentially with per-chunk progress reporting
- Uses batch inference (
segment_batch()) for efficient processing within each chunk
Compatible Model Requirements
TheTransformerWordSegmenter is model-agnostic. Any HuggingFace model can be used as long as it meets these requirements:
-
Task: Must be a
token-classificationmodel (compatible withtransformers.pipeline("token-classification", ...)) -
Labels: Must output
entity_groupvalues of"B"and"I":B= Beginning of a new wordI= Inside/continuation of the current word
- Tokenizer: Must include a compatible tokenizer (automatically loaded by the HuggingFace pipeline)
- Hosting: Can be hosted on HuggingFace Hub (loaded by model ID) or stored locally (loaded by file path)
chuuhtetnaing/myanmar-text-segmentation-model, an XLM-RoBERTa model fine-tuned for Myanmar text segmentation.
Error Handling
Properties
| Property | Type | Description |
|---|---|---|
model_name | str | The model ID or path being used |
device | int | The device being used (-1 = CPU, 0+ = GPU) |
batch_size | int | The batch size for batch processing |
max_length | int | Maximum sequence length |
is_fork_safe | bool | True for CPU mode, False for GPU mode |
Default Model Attribution
The default model ischuuhtetnaing/myanmar-text-segmentation-model:
- Author: Chuu Htet Naing
- Base: XLM-RoBERTa fine-tuned for token classification
- Labels: B (beginning), I (inside)
- License: See model page for details
Performance Comparison
| Operation | SyllableTokenizer | WordTokenizer (myword) | WordTokenizer (CRF) |
|---|---|---|---|
| Short text (10 chars) | ~5μs | ~50μs | ~100μs |
| Medium text (100 chars) | ~20μs | ~200μs | ~500μs |
| Long text (1000 chars) | ~100μs | ~1ms | ~3ms |
Attribution
The word segmentation algorithms are based on research by Ye Kyaw Thu: The transformer word segmentation uses the model by Chuu Htet Naing:See Also
- Syllable Segmentation - Algorithm details
- Segmentation - Word segmentation algorithms
- Cython Guide - Performance optimization
- Data Pipeline - Using tokenizers in corpus processing