Processing Stage - mySpellChecker

After ingestion converts raw corpus files into Arrow shards, the processing stage runs normalization and segmentation over every record — breaking continuous Myanmar text into syllables and words that downstream stages can count and index.

Overview

Components

CorpusSegmenter

The CorpusSegmenter processes Arrow shards and produces segmented output:

from myspellchecker.data_pipeline import CorpusSegmenter

segmenter = CorpusSegmenter(
    output_dir="intermediate/",
    word_engine="myword",  # "myword", "crf", or "transformer"
)

# Segment corpus from Arrow shards
segmented_path = segmenter.segment_corpus(input_files)

Segmentation via Pipeline (Recommended)

For most use cases, use the Pipeline class:

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    word_engine="myword",  # Segmentation engine (default)
    num_workers=4,         # Parallel workers
)

pipeline = Pipeline(config=config)
pipeline.build_database(input_files, database_path)

Configuration

from myspellchecker.data_pipeline import PipelineConfig, SegmenterConfig

# Via PipelineConfig (recommended)
config = PipelineConfig(
    word_engine="myword",    # "myword", "crf", or "transformer"
    num_workers=4,           # Parallel workers (None = auto)
    batch_size=10000,        # Records per batch
)

# Segmenter-specific configuration
segmenter_config = SegmenterConfig(
    batch_size=10000,
    word_engine="myword",
    num_workers=4,
    enable_pos_tagging=True,
    chunk_size=50000,        # Lines per chunk for parallel processing
)

Options

Option	Default	Description
`num_workers`	`None`	Parallel workers (None = auto)
`batch_size`	`10000`	Records per batch
`word_engine`	`"myword"`	Segmentation engine
`enable_pos_tagging`	`True`	Enable POS tagging during segmentation
`chunk_size`	`50000`	Lines per chunk for parallel processing

Segmentation Engines

MyWord (Default)

High-accuracy segmentation using the myword library:

config = PipelineConfig(
    word_engine="myword",
)

CRF

Conditional Random Fields - good balance of speed and accuracy.

config = PipelineConfig(
    word_engine="crf",
)

Transformer

Highest accuracy using a HuggingFace token classification model (XLM-RoBERTa fine-tuned for Myanmar word boundary detection). Requires the transformers package.

config = PipelineConfig(
    word_engine="transformer",
    seg_model="chuuhtetnaing/myanmar-text-segmentation-model",  # Optional custom model
    seg_device=-1,  # -1=CPU, 0+=GPU
)

Comparison

Engine	Speed	Accuracy	Dependencies
MyWord	Medium	~95-98%	myword
CRF	Fast	~92-95%	sklearn-crfsuite
Transformer	Slow	~97-99%	transformers, torch

Parallel Processing

Worker Configuration

Configure parallel workers via PipelineConfig:

# Auto-detect CPU cores
config = PipelineConfig(num_workers=None)

# Manual setting
config = PipelineConfig(num_workers=8)

macOS Note

OpenMP requires libomp on macOS:

brew install libomp

Performance Optimization

Batch Size

Larger batches improve throughput:

# Small files
config = PipelineConfig(batch_size=1000)

# Large files
config = PipelineConfig(batch_size=50000)

Memory Management

For memory-constrained environments:

config = PipelineConfig(
    batch_size=5000,  # Smaller batches
    num_workers=2,    # Fewer workers
)

Benchmarks

Batch Size	Workers	Throughput
1,000	1	~10K rec/s
10,000	4	~50K rec/s
50,000	8	~100K rec/s

Integration with Pipeline

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    word_engine="myword",
    num_workers=4,
    batch_size=10000,
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

​Overview

​Components

​CorpusSegmenter

​Segmentation via Pipeline (Recommended)

​Configuration

​Options

​Segmentation Engines

​MyWord (Default)

​CRF

​Transformer

​Comparison

​Parallel Processing

​Worker Configuration

​macOS Note

​Performance Optimization

​Batch Size

​Memory Management

​Benchmarks

​Integration with Pipeline

​See Also