myspellchecker build CLI command. It coordinates the full corpus-to-database transformation: segmenting text, counting frequencies, extracting N-grams, and inferring POS tags through a staged pipeline with resume support for large datasets.
It is designed to handle large datasets (10GB+) by using sharding, intermediate binary formats (Arrow), and resume capabilities.
Usage
CLI Usage
The easiest way to use the pipeline is via the command line interface:Python API Usage
You can also invoke the pipeline programmatically:Architecture
The pipeline executes in 5 distinct steps. It tracks file modification times to skip steps that are already up-to-date (Resume Capability).Ingestion
- Input: Raw text files (
.txt,.csv,.tsv,.json,.jsonl,.parquet). - Process:
- Reads files in chunks.
- Normalizes text (Unicode normalization).
- Splits into shards for parallel processing.
- Output:
raw_shards/*.arrow(Apache Arrow files).
Segmentation
- Input:
raw_shards/*.arrow - Process:
- Iterates through shards.
- Segments text into sentences and syllables using the configured
word_engine.- Default:
"myword"(bothPipelineConfigand CLI)
- Default:
- Applies POS tagging using the configured
pos_tagger(Rule-Based, Viterbi, or Transformer).
- Output:
segmented_corpus.arrow
Frequency Building
- Input:
segmented_corpus.arrow - Process:
- Counts occurrences of Syllables, Words, Bigrams, and Trigrams.
- Calculates POS tag probabilities (Unigram/Bigram/Trigram).
- Filters items below
min_frequency.
- Output: TSV files (e.g.,
word_frequencies.tsv,bigram_probabilities.tsv).
Packaging
- Input: TSV frequency files.
- Process:
- Creates SQLite schema.
- Bulk loads data using transactions.
- Optimizes database indices (
VACUUM,ANALYZE).
- Output: Final SQLite
.dbfile.
Enrichment
- Input: The packaged SQLite
.dbfile from Step 4. - Process:
- Mines confusable pairs (phonetic/orthographic variants like aspiration swaps, medial swaps, nasal endings).
- Detects compound confusions (words incorrectly split during segmentation).
- Extracts collocations using PMI/NPMI scoring.
- Tags words with register labels (formal/informal/neutral) based on marker co-occurrence.
- Output: Enrichment tables added to the SQLite
.dbfile (confusable_pairs,compound_confusions,collocations,register_tags). - Disable with
--no-enrichon the CLI orenrich=FalseinPipelineConfig.
Configuration
ThePipelineConfig class supports fine-tuning:
| Parameter | Default | Description |
|---|---|---|
batch_size | 10,000 | Rows per Arrow batch. |
num_shards | 20 | Number of shards to split ingested data into for parallel processing. |
num_workers | None (auto-detect at runtime) | Number of parallel processes for segmentation. |
min_frequency | 50 | Words appearing fewer times than this are discarded. |
min_syllable_frequency | 1 | Minimum frequency for syllables to be included. |
min_bigram_count | 10 | Minimum count for bigrams to be included. |
min_trigram_count | 20 | Minimum count for trigrams to be included. |
min_fourgram_count | 3 | Minimum count for fourgrams to be included. |
min_fivegram_count | 2 | Minimum count for fivegrams to be included. |
deduplicate_lines | True | Hash-based deduplication of lines within and across files. Disable with --no-dedup. |
remove_segmentation_markers | True | Strip artificial word segmentation markers (spaces/underscores between Myanmar characters). Disable with --no-desegment. |
allow_extended_myanmar | False | Include extended Myanmar Unicode blocks (U+1050-U+109F, U+AA60-U+AA7F, U+A9E0-U+A9FF). |
keep_intermediate | False | If True, temporary files are not deleted after success. |
text_col | "text" | Column name for CSV/TSV text ingestion. |
json_key | "text" | Key name for JSON/JSONL text ingestion. |
word_engine | "myword" | Word segmentation engine: "crf", "myword", or "transformer". |
enrich | True | Master toggle for Step 5 enrichment. Disable with --no-enrich. |
enrich_confusables | True | Mine confusable pairs (aspiration swaps, medial swaps, etc.). |
enrich_compounds | True | Detect compound confusions (incorrectly split words). |
enrich_collocations | True | Extract collocations using PMI/NPMI scoring. |
enrich_register | True | Tag words with formal/informal register labels. |
disk_space_check_mb | 51200 | Minimum free disk space required in MB (50 GB). Set to 0 to disable. |
Incremental Updates
The pipeline supports Incremental Updates to add new data to an existing database without rebuilding from scratch:Curated Lexicon Support
You can mark specific words as trusted/curated (is_curated=1) in the database using the --curated-input option:
word column header:
is_curated flag:
- Words from POS seed file →
is_curated=1(with POS tags) - Words from curated lexicon →
is_curated=1 - Other corpus words →
is_curated=0
scripts/merge_vocabulary.py utility to prepare curated lexicons by merging and deduplicating vocabulary files from multiple sources.
See Custom Dictionaries Guide for detailed usage examples.