Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
After ingestion converts raw corpus files into Arrow shards, the processing stage runs normalization and segmentation over every record — breaking continuous Myanmar text into syllables and words that downstream stages can count and index.
Overview
Components
CorpusSegmenter
The CorpusSegmenter processes Arrow shards and produces segmented output:
from myspellchecker.data_pipeline import CorpusSegmenter
segmenter = CorpusSegmenter(
output_dir="intermediate/",
word_engine="myword", # "myword", "crf", or "transformer"
)
# Segment corpus from Arrow shards
segmented_path = segmenter.segment_corpus(input_files)
Segmentation via Pipeline (Recommended)
For most use cases, use the Pipeline class:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
config = PipelineConfig(
word_engine="myword", # Segmentation engine (default)
num_workers=4, # Parallel workers
)
pipeline = Pipeline(config=config)
pipeline.build_database(input_files, database_path)
Configuration
from myspellchecker.data_pipeline import PipelineConfig, SegmenterConfig
# Via PipelineConfig (recommended)
config = PipelineConfig(
word_engine="myword", # "myword", "crf", or "transformer"
num_workers=4, # Parallel workers (None = auto)
batch_size=10000, # Records per batch
)
# Segmenter-specific configuration
segmenter_config = SegmenterConfig(
batch_size=10000,
word_engine="myword",
num_workers=4,
enable_pos_tagging=True,
chunk_size=50000, # Lines per chunk for parallel processing
)
Options
| Option | Default | Description |
|---|
num_workers | None | Parallel workers (None = auto) |
batch_size | 10000 | Records per batch |
word_engine | "myword" | Segmentation engine |
enable_pos_tagging | True | Enable POS tagging during segmentation |
chunk_size | 50000 | Lines per chunk for parallel processing |
Segmentation Engines
MyWord (Default)
High-accuracy segmentation using the myword library:
config = PipelineConfig(
word_engine="myword",
)
CRF
Conditional Random Fields - good balance of speed and accuracy.
config = PipelineConfig(
word_engine="crf",
)
Highest accuracy using a HuggingFace token classification model (XLM-RoBERTa fine-tuned for Myanmar word boundary detection). Requires the transformers package.
config = PipelineConfig(
word_engine="transformer",
seg_model="chuuhtetnaing/myanmar-text-segmentation-model", # Optional custom model
seg_device=-1, # -1=CPU, 0+=GPU
)
Comparison
| Engine | Speed | Accuracy | Dependencies |
|---|
| MyWord | Medium | ~95-98% | myword |
| CRF | Fast | ~92-95% | sklearn-crfsuite |
| Transformer | Slow | ~97-99% | transformers, torch |
Parallel Processing
Worker Configuration
Configure parallel workers via PipelineConfig:
# Auto-detect CPU cores
config = PipelineConfig(num_workers=None)
# Manual setting
config = PipelineConfig(num_workers=8)
macOS Note
OpenMP requires libomp on macOS:
Batch Size
Larger batches improve throughput:
# Small files
config = PipelineConfig(batch_size=1000)
# Large files
config = PipelineConfig(batch_size=50000)
Memory Management
For memory-constrained environments:
config = PipelineConfig(
batch_size=5000, # Smaller batches
num_workers=2, # Fewer workers
)
Benchmarks
| Batch Size | Workers | Throughput |
|---|
| 1,000 | 1 | ~10K rec/s |
| 10,000 | 4 | ~50K rec/s |
| 50,000 | 8 | ~100K rec/s |
Integration with Pipeline
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
config = PipelineConfig(
word_engine="myword",
num_workers=4,
batch_size=10000,
)
pipeline = Pipeline(config=config)
pipeline.build_database(
input_files=["corpus.txt"],
database_path="dict.db",
)
See Also