Overview

The Data Pipeline is the subsystem responsible for ingesting raw Myanmar text corpora and converting them into the optimized SQLite database (.db) used by mySpellChecker at runtime. It is designed to handle large datasets (10GB+) by using sharding, intermediate binary formats (Arrow), and resume capabilities.

Usage

CLI Usage

The easiest way to use the pipeline is via the command line interface:

# Build a database from a raw text file
myspellchecker build --input raw_corpus.txt --output my_dictionary.db

# Build from multiple files with custom min frequency
myspellchecker build --input part1.txt part2.txt --min-frequency 5

# Create a sample database for testing
myspellchecker build --sample --output sample.db

Python API Usage

You can also invoke the pipeline programmatically:

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Configure the pipeline
config = PipelineConfig(
    batch_size=50000,
    num_workers=4,
    min_frequency=2
)

# Initialize and run
pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus/news.txt", "corpus/wiki.txt"],
    database_path="my_dictionary.db"
)

Architecture

The pipeline executes in 4 distinct steps. It tracks file modification times to skip steps that are already up-to-date (Resume Capability).

Ingestion

Input: Raw text files (.txt, .csv, .tsv, .json, .jsonl, .parquet).
Process:
- Reads files in chunks.
- Normalizes text (Unicode normalization).
- Splits into shards for parallel processing.
Output: raw_shards/*.arrow (Apache Arrow files).

Segmentation

Input: raw_shards/*.arrow
Process:
- Iterates through shards.
- Segments text into sentences and syllables using the configured word_engine.
  - Default: "myword" (both PipelineConfig and CLI)
- Applies POS tagging using the configured pos_tagger (Rule-Based, Viterbi, or Transformer).
Output: segmented_corpus.arrow

Frequency Building

Input: segmented_corpus.arrow
Process:
- Counts occurrences of Syllables, Words, Bigrams, and Trigrams.
- Calculates POS tag probabilities (Unigram/Bigram/Trigram).
- Filters items below min_frequency.
Output: TSV files (e.g., word_frequencies.tsv, bigram_probabilities.tsv).

Packaging

Input: TSV frequency files.
Process:
- Creates SQLite schema.
- Bulk loads data using transactions.
- Optimizes database indices (VACUUM, ANALYZE).
Output: Final SQLite .db file.

Configuration

The PipelineConfig class supports fine-tuning:

Parameter	Default	Description
`min_frequency`	50	Words appearing fewer times than this are discarded.
`num_workers`	None (auto-detect at runtime)	Number of parallel processes for ingestion.
`batch_size`	10,000	Rows per Arrow batch.
`disk_space_check_mb`	51200	Minimum free disk space required in MB (50 GB). Set to 0 to disable.
`keep_intermediate`	False	If True, temporary files are not deleted after success.

Incremental Updates

The pipeline supports Incremental Updates to add new data to an existing database without rebuilding from scratch:

myspellchecker build --incremental --input new_data.txt --output existing.db

This merges the new counts with the existing database statistics.

Curated Lexicon Support

You can mark specific words as trusted/curated (is_curated=1) in the database using the --curated-input option:

# Build with curated lexicon
myspellchecker build --input corpus.txt --output dictionary.db \
  --curated-input curated_lexicon.csv

The curated lexicon must be a CSV file with a word column header:

word
ဆေးရုံ
ဆရာဝန်
လူနာ

Priority hierarchy for is_curated flag:

Words from POS seed file → is_curated=1 (with POS tags)
Words from curated lexicon → is_curated=1
Other corpus words → is_curated=0

Use the scripts/merge_vocabulary.py utility to prepare curated lexicons by merging and deduplicating vocabulary files from multiple sources. See Custom Dictionaries Guide for detailed usage examples.

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Usage

CLI Usage

Python API Usage

Architecture

Configuration

Incremental Updates

Curated Lexicon Support

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Usage

​CLI Usage

​Python API Usage

​Architecture

​Configuration

​Incremental Updates

​Curated Lexicon Support

Usage

CLI Usage

Python API Usage

Architecture

Configuration

Incremental Updates

Curated Lexicon Support