Skip to main content
The Data Pipeline is the subsystem responsible for ingesting raw Myanmar text corpora and converting them into the optimized SQLite database (.db) used by mySpellChecker at runtime. It is designed to handle large datasets (10GB+) by using sharding, intermediate binary formats (Arrow), and resume capabilities.

Usage

CLI Usage

The easiest way to use the pipeline is via the command line interface:
# Build a database from a raw text file
myspellchecker build --input raw_corpus.txt --output my_dictionary.db

# Build from multiple files with custom min frequency
myspellchecker build --input part1.txt part2.txt --min-frequency 5

# Create a sample database for testing
myspellchecker build --sample --output sample.db

Python API Usage

You can also invoke the pipeline programmatically:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Configure the pipeline
config = PipelineConfig(
    batch_size=50000,
    num_workers=4,
    min_frequency=2
)

# Initialize and run
pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus/news.txt", "corpus/wiki.txt"],
    database_path="my_dictionary.db"
)

Architecture

The pipeline executes in 4 distinct steps. It tracks file modification times to skip steps that are already up-to-date (Resume Capability).
1

Ingestion

  • Input: Raw text files (.txt, .csv, .tsv, .json, .jsonl, .parquet).
  • Process:
    • Reads files in chunks.
    • Normalizes text (Unicode normalization).
    • Splits into shards for parallel processing.
  • Output: raw_shards/*.arrow (Apache Arrow files).
2

Segmentation

  • Input: raw_shards/*.arrow
  • Process:
    • Iterates through shards.
    • Segments text into sentences and syllables using the configured word_engine.
      • Default: "myword" (both PipelineConfig and CLI)
    • Applies POS tagging using the configured pos_tagger (Rule-Based, Viterbi, or Transformer).
  • Output: segmented_corpus.arrow
3

Frequency Building

  • Input: segmented_corpus.arrow
  • Process:
    • Counts occurrences of Syllables, Words, Bigrams, and Trigrams.
    • Calculates POS tag probabilities (Unigram/Bigram/Trigram).
    • Filters items below min_frequency.
  • Output: TSV files (e.g., word_frequencies.tsv, bigram_probabilities.tsv).
4

Packaging

  • Input: TSV frequency files.
  • Process:
    • Creates SQLite schema.
    • Bulk loads data using transactions.
    • Optimizes database indices (VACUUM, ANALYZE).
  • Output: Final SQLite .db file.

Configuration

The PipelineConfig class supports fine-tuning:
ParameterDefaultDescription
min_frequency50Words appearing fewer times than this are discarded.
num_workersNone (auto-detect at runtime)Number of parallel processes for ingestion.
batch_size10,000Rows per Arrow batch.
disk_space_check_mb51200Minimum free disk space required in MB (50 GB). Set to 0 to disable.
keep_intermediateFalseIf True, temporary files are not deleted after success.

Incremental Updates

The pipeline supports Incremental Updates to add new data to an existing database without rebuilding from scratch:
myspellchecker build --incremental --input new_data.txt --output existing.db
This merges the new counts with the existing database statistics.

Curated Lexicon Support

You can mark specific words as trusted/curated (is_curated=1) in the database using the --curated-input option:
# Build with curated lexicon
myspellchecker build --input corpus.txt --output dictionary.db \
  --curated-input curated_lexicon.csv
The curated lexicon must be a CSV file with a word column header:
word
ဆေးရုံ
ဆရာဝန်
လူနာ
Priority hierarchy for is_curated flag:
  1. Words from POS seed file → is_curated=1 (with POS tags)
  2. Words from curated lexicon → is_curated=1
  3. Other corpus words → is_curated=0
Use the scripts/merge_vocabulary.py utility to prepare curated lexicons by merging and deduplicating vocabulary files from multiple sources. See Custom Dictionaries Guide for detailed usage examples.