CLI Reference
Basic Build
Build Options
| Option | Default | Description |
|---|---|---|
--input FILE | Required | Input corpus file (TXT, CSV, TSV, JSON, JSONL, Parquet) |
--output FILE | mySpellChecker-default.db | Output SQLite database path |
--sample | — | Build a small sample database (no input needed) |
--incremental | — | Update existing database instead of rebuilding |
--min-frequency N | 50 | Minimum word frequency to include |
--pos-tagger TYPE | None | POS tagger: rule_based, viterbi, or transformer. No POS tagging when omitted. |
--pos-model NAME | — | HuggingFace model for transformer tagger |
--pos-device ID | -1 | Device for transformer POS tagger (-1=CPU, 0+=GPU) |
--num-workers N | CPU count | Parallel worker processes |
--batch-size N | 10000 | Records per processing batch |
--curated-input FILE | — | CSV file with trusted vocabulary words |
--word-engine TYPE | myword | Word segmentation engine: myword, crf, or transformer |
--validate | — | Pre-flight validation of input (no build) |
--no-enrich | — | Skip enrichment step (confusable pairs, compounds, collocations, register) |
--seg-model NAME | — | HuggingFace model name/path for transformer word segmentation (only with --word-engine=transformer) |
--seg-device ID | -1 | Device for transformer segmentation inference (-1=CPU, 0+=GPU) |
--curated-lexicon-hf | — | Download and use official curated lexicon from HuggingFace (thettwe/myspellchecker-resources) |
--no-dedup | — | Disable line deduplication during ingestion |
--no-desegment | — | Keep segmentation markers in text |
--verbose / -v | — | Enable verbose logging with detailed timing breakdowns |
Run
myspellchecker build --help for additional flags including --work-dir, --keep-intermediate, --col, --json-key, and --worker-timeout.Python API
Basic Usage
With Configuration
Building from Multiple Files
POS Tagging
Add POS tags to dictionary entries during build for grammar checking support:POS Inference on Existing Database
Apply rule-based POS inference to an existing database without rebuilding:Curated Lexicons
Curated words are trusted vocabulary inserted directly into the database before corpus processing. They are always recognized as valid regardless of corpus frequency.Create a Lexicon CSV
Build with Curated Words
How Curated Words are Processed
| Scenario | frequency | is_curated |
|---|---|---|
| Curated only (not in corpus) | 0 | 1 |
| Curated + corpus overlap | corpus_freq | 1 |
| Corpus only | corpus_freq | 0 |
is_curated=1, frequency=0), then corpus processing updates frequency while preserving the is_curated flag via MAX().
Incremental Updates
Add new data to an existing dictionary without rebuilding from scratch:processed_files table to avoid reprocessing.
Enrichment
After frequency computation, the pipeline runs an enrichment step that mines additional linguistic data from the corpus. Disable with--no-enrich on the CLI or enrich=False in PipelineConfig.
What Gets Mined
| Enrichment | Table | Purpose |
|---|---|---|
| Confusable pairs | confusable_pairs | Phonetically/orthographically similar word pairs (aspiration swaps, medial swaps, tone marks, nasal endings) |
| Compound confusions | compound_confusions | Words that may be incorrectly split during segmentation (e.g., “မြန်မာ” split as “မြန်” + “မာ”) |
| Collocations | collocations | Statistically significant word pairs with PMI/NPMI scores |
| Register tags | register_tags | Formal/informal register classification based on marker co-occurrence |
Configuration
Enrichment Thresholds
Fine-tune the mining process viaEnrichmentConfig (passed internally from PipelineConfig):
| Parameter | Default | Description |
|---|---|---|
confusable_min_freq | 50 | Minimum word frequency to generate confusable variants |
confusable_max_freq_ratio | 1000.0 | Maximum frequency ratio between pair members |
compound_min_freq | 100 | Minimum compound frequency to include |
compound_min_split_count | 10 | Minimum bigram count for split form |
compound_min_pmi | 2.0 | Minimum PMI for compound pairs |
collocation_min_count | 20 | Minimum bigram occurrences |
collocation_min_pmi | 3.0 | Minimum PMI for collocations |
register_min_total | 50 | Minimum co-occurrence count with register markers |
register_threshold | 0.3 | Score cutoff for formal/informal classification |
Confusable Pair Mining
Generates phonetic/orthographic variants for every word above a frequency threshold, then checks which variants are also valid dictionary words. Context overlap (cosine similarity of bigram context vectors) and frequency ratio are computed for each pair. Variant types mined:- Aspiration swaps (က↔ခ, ပ↔ဖ, etc.)
- Medial swaps (ျ↔ြ) and medial insertion/deletion
- Nasal ending confusion (န်↔မ်↔ံ)
- Stop-coda confusion
- Tone mark changes (visarga add/remove)
- Vowel length changes
Compound Confusion Detection
Finds bigrams(w1, w2) where the concatenation w1+w2 is a high-frequency dictionary word. Computes PMI to measure how strongly the parts associate:
Collocation Mining
Extracts statistically significant word pairs using Pointwise Mutual Information. Normalized PMI (NPMI) provides a scale-independent score in [-1, 1].Register Tagging
Classifies words as formal, informal, or neutral based on co-occurrence with register markers. Words appearing predominantly with formal sentence-final particles are taggedformal, and vice versa.
Output Database Schema
Query Examples
Verification
Performance
For large corpora, see Optimization for DuckDB acceleration (3-15x faster frequency counting) and Cython parallelization.See Also
- Corpus Format - Input file specifications
- Optimization - DuckDB, Cython, performance tuning
- Custom Dictionaries - Curated lexicons, domain builds
- Database Schema - Full schema reference
- CLI Reference - All CLI commands