Skip to main content
This is the internal subsystem that powers the myspellchecker build CLI command. It coordinates the full corpus-to-database transformation: segmenting text, counting frequencies, extracting N-grams, and inferring POS tags through a staged pipeline with resume support for large datasets. It is designed to handle large datasets (10GB+) by using sharding, intermediate binary formats (Arrow), and resume capabilities.

Usage

CLI Usage

The easiest way to use the pipeline is via the command line interface:
# Build a database from a raw text file
myspellchecker build --input raw_corpus.txt --output my_dictionary.db

# Build from multiple files with custom min frequency
myspellchecker build --input part1.txt part2.txt --min-frequency 5

# Create a sample database for testing
myspellchecker build --sample --output sample.db

Python API Usage

You can also invoke the pipeline programmatically:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Configure the pipeline
config = PipelineConfig(
    batch_size=50000,
    num_workers=4,
    min_frequency=2
)

# Initialize and run
pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus/news.txt", "corpus/wiki.txt"],
    database_path="my_dictionary.db"
)

Architecture

The pipeline executes in 5 distinct steps. It tracks file modification times to skip steps that are already up-to-date (Resume Capability).
1

Ingestion

  • Input: Raw text files (.txt, .csv, .tsv, .json, .jsonl, .parquet).
  • Process:
    • Reads files in chunks.
    • Normalizes text (Unicode normalization).
    • Splits into shards for parallel processing.
  • Output: raw_shards/*.arrow (Apache Arrow files).
2

Segmentation

  • Input: raw_shards/*.arrow
  • Process:
    • Iterates through shards.
    • Segments text into sentences and syllables using the configured word_engine.
      • Default: "myword" (both PipelineConfig and CLI)
    • Applies POS tagging using the configured pos_tagger (Rule-Based, Viterbi, or Transformer).
  • Output: segmented_corpus.arrow
3

Frequency Building

  • Input: segmented_corpus.arrow
  • Process:
    • Counts occurrences of Syllables, Words, Bigrams, and Trigrams.
    • Calculates POS tag probabilities (Unigram/Bigram/Trigram).
    • Filters items below min_frequency.
  • Output: TSV files (e.g., word_frequencies.tsv, bigram_probabilities.tsv).
4

Packaging

  • Input: TSV frequency files.
  • Process:
    • Creates SQLite schema.
    • Bulk loads data using transactions.
    • Optimizes database indices (VACUUM, ANALYZE).
  • Output: Final SQLite .db file.
5

Enrichment

  • Input: The packaged SQLite .db file from Step 4.
  • Process:
    • Mines confusable pairs (phonetic/orthographic variants like aspiration swaps, medial swaps, nasal endings).
    • Detects compound confusions (words incorrectly split during segmentation).
    • Extracts collocations using PMI/NPMI scoring.
    • Tags words with register labels (formal/informal/neutral) based on marker co-occurrence.
  • Output: Enrichment tables added to the SQLite .db file (confusable_pairs, compound_confusions, collocations, register_tags).
  • Disable with --no-enrich on the CLI or enrich=False in PipelineConfig.

Configuration

The PipelineConfig class supports fine-tuning:
ParameterDefaultDescription
batch_size10,000Rows per Arrow batch.
num_shards20Number of shards to split ingested data into for parallel processing.
num_workersNone (auto-detect at runtime)Number of parallel processes for segmentation.
min_frequency50Words appearing fewer times than this are discarded.
min_syllable_frequency1Minimum frequency for syllables to be included.
min_bigram_count10Minimum count for bigrams to be included.
min_trigram_count20Minimum count for trigrams to be included.
min_fourgram_count3Minimum count for fourgrams to be included.
min_fivegram_count2Minimum count for fivegrams to be included.
deduplicate_linesTrueHash-based deduplication of lines within and across files. Disable with --no-dedup.
remove_segmentation_markersTrueStrip artificial word segmentation markers (spaces/underscores between Myanmar characters). Disable with --no-desegment.
allow_extended_myanmarFalseInclude extended Myanmar Unicode blocks (U+1050-U+109F, U+AA60-U+AA7F, U+A9E0-U+A9FF).
keep_intermediateFalseIf True, temporary files are not deleted after success.
text_col"text"Column name for CSV/TSV text ingestion.
json_key"text"Key name for JSON/JSONL text ingestion.
word_engine"myword"Word segmentation engine: "crf", "myword", or "transformer".
enrichTrueMaster toggle for Step 5 enrichment. Disable with --no-enrich.
enrich_confusablesTrueMine confusable pairs (aspiration swaps, medial swaps, etc.).
enrich_compoundsTrueDetect compound confusions (incorrectly split words).
enrich_collocationsTrueExtract collocations using PMI/NPMI scoring.
enrich_registerTrueTag words with formal/informal register labels.
disk_space_check_mb51200Minimum free disk space required in MB (50 GB). Set to 0 to disable.

Incremental Updates

The pipeline supports Incremental Updates to add new data to an existing database without rebuilding from scratch:
myspellchecker build --incremental --input new_data.txt --output existing.db
This merges the new counts with the existing database statistics.
If corpus files are removed between incremental runs, the pipeline will log a warning listing the missing files. Data from deleted files persists in the database. Run a full (non-incremental) rebuild to clean up stale data.

Curated Lexicon Support

You can mark specific words as trusted/curated (is_curated=1) in the database using the --curated-input option:
# Build with curated lexicon
myspellchecker build --input corpus.txt --output dictionary.db \
  --curated-input curated_lexicon.csv
The curated lexicon must be a CSV file with a word column header:
word
ဆေးရုံ
ဆရာဝန်
လူနာ
Priority hierarchy for is_curated flag:
  1. Words from POS seed file → is_curated=1 (with POS tags)
  2. Words from curated lexicon → is_curated=1
  3. Other corpus words → is_curated=0
Use the scripts/merge_vocabulary.py utility to prepare curated lexicons by merging and deduplicating vocabulary files from multiple sources. See Custom Dictionaries Guide for detailed usage examples.