myspellchecker CLI is installed with the package and provides commands for checking text, building dictionaries, training AI models, segmenting text, and managing configuration.
Installation
The CLI is installed automatically with the package:Commands Overview
| Command | Description |
|---|---|
check | Check text for spelling errors (default when no command given) |
build | Build dictionary database from corpus |
train-model | Train a custom semantic model |
train-detector | Train an error detection model (token classification) |
segment | Segment text into words |
config | Manage configuration |
infer-pos | Infer POS tags for database |
completion | Generate shell completion |
check
Check text for spelling errors. This is the default command: if no subcommand is recognized,check is assumed.
Usage
Arguments
| Argument | Description |
|---|---|
INPUT | Input file path (or stdin if omitted) |
Options
| Option | Short | Description |
|---|---|---|
--output | -o | Output file path (default: stdout) |
--format | -f | Output format: json, text, csv, rich (default: rich for TTY, json for pipes) |
--color | Force color output even when not a TTY | |
--no-color | Disable color output | |
--level | Validation level: syllable, word (default: syllable) | |
--db | Custom database path | |
--no-phonetic | Disable phonetic matching | |
--no-context | Disable context checking | |
--preset | -p | Configuration preset: default, fast, accurate, minimal, strict |
--verbose | -v | Enable verbose logging |
--config | -c | Path to configuration file (YAML or JSON format) |
--color and --no-color are mutually exclusive.
Examples
Output Formats
Rich (default in terminal): Colored, formatted output with panels and tables using the Rich library. Auto-selected when running in an interactive terminal. Text (grep-like):build
Build a dictionary database from corpus files.Usage
Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input corpus file(s) (UTF-8 encoded text, CSV, TSV, JSON) |
--output | -o | Output database path (default: mySpellChecker-default.db) |
--work-dir | Directory for intermediate files (default: temp_build) | |
--keep-intermediate | Keep intermediate files after build | |
--sample | Generate sample corpus for testing | |
--col | Column name/index for CSV/TSV files (default: text) | |
--json-key | Key name for JSON objects (default: text) | |
--pos-tagger | POS tagger type: rule_based, viterbi, transformer | |
--pos-model | HuggingFace model ID or local path for transformer tagger | |
--pos-device | Device for transformer POS tagger: -1=CPU, 0+=GPU (default: -1) | |
--incremental | Perform incremental update on existing database | |
--curated-input | Path to curated lexicon CSV file (words marked as is_curated=1) | |
--word-engine | Word segmentation engine: myword, crf, transformer (default: myword) | |
--seg-model | HuggingFace model ID or local path for transformer word segmentation (only used when --word-engine=transformer) | |
--seg-device | Device for transformer word segmenter: -1=CPU, 0+=GPU (default: -1, only used when --word-engine=transformer) | |
--validate | Validate inputs only without building (pre-flight check) | |
--min-frequency | Minimum word frequency threshold (default: from config) | |
--num-workers | Number of parallel workers (default: auto-detect based on CPU cores) | |
--batch-size | Batch size for processing (default: 10000) | |
--worker-timeout | Worker timeout in seconds for parallel processing (default: 300) | |
--no-dedup | Disable line-level deduplication during ingestion | |
--no-desegment | Keep word segmentation markers in output | |
--verbose | -v | Enable verbose logging with detailed timing breakdowns |
Examples
Build Process
- Ingestion: Read and parse input files
- Segmentation: Break text into syllables and words
- Frequency Calculation: Count occurrences and N-grams
- POS Tagging: Tag words with part-of-speech (if enabled)
- Packaging: Create optimized SQLite database
train-model
Train semantic models for context checking.Usage
Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input corpus file (required; raw text, one sentence per line) |
--output | -o | Output directory for the model (required) |
--architecture | -a | Model architecture: roberta, bert (default: roberta) |
--epochs | Number of training epochs (default: 5) | |
--batch-size | Training batch size (default: 16) | |
--learning-rate | Peak learning rate (default: 5e-5) | |
--warmup-ratio | Ratio of steps for LR warmup (default: 0.1) | |
--weight-decay | Weight decay for optimizer (default: 0.01) | |
--hidden-size | Size of hidden layers (default: 256) | |
--layers | Number of transformer layers (default: 4) | |
--heads | Number of attention heads (default: 4) | |
--max-length | Maximum sequence length (default: 128) | |
--vocab-size | Tokenizer vocabulary size (default: 15000) | |
--min-frequency | Minimum token frequency (default: 2) | |
--resume | Resume training from checkpoint directory | |
--keep-checkpoints | Keep intermediate PyTorch checkpoints after export | |
--no-metrics | Disable saving training metrics to JSON |
Architectures
| Architecture | Description |
|---|---|
roberta | RoBERTa (default) - Dynamic masking, no NSP |
bert | BERT - Static masking, with NSP capability |
Examples
train-detector
Train an error detection model using token classification (fine-tunes XLM-RoBERTa).Usage
Options
| Option | Short | Description |
|---|---|---|
--input | -i | Input corpus file (required; clean text, one sentence per line) |
--output | -o | Output directory for the model (required) |
--base-model | Base model for fine-tuning (default: xlm-roberta-base) | |
--epochs | Number of training epochs (default: 3) | |
--batch-size | Training batch size (default: 16) | |
--learning-rate | Peak learning rate (default: 2e-5) | |
--corruption-ratio | Fraction of words to corrupt per sentence (default: 0.15) | |
--max-length | Maximum sequence length (default: 256) | |
--keep-checkpoints | Keep intermediate PyTorch checkpoints after ONNX export | |
--no-metrics | Disable saving training metrics to JSON | |
--seed | Random seed for reproducibility | |
--skip-preprocessing | Skip corpus preprocessing (Zawgyi conversion, normalization) |
Examples
Training Process
- Load Corpus — Read input file (one sentence per line)
- Preprocess — Zawgyi conversion, Unicode normalization, quality filtering
- Generate Errors — Create synthetic errors from YAML rules
- Build Dataset — Tokenize with subword label alignment
- Train Model — Fine-tune XLM-RoBERTa for token classification
- Export to ONNX — Quantize and export for inference
segment
Segment text into words and optionally tag with POS.Usage
Options
| Option | Short | Description |
|---|---|---|
--output | -o | Output file path (default: stdout) |
--format | -f | Output format: text, json, tsv (default: text) |
--tag | Include POS tags (uses joint segmentation-tagging) | |
--db | Custom database path | |
--verbose | -v | Enable verbose logging |
Examples
Output
Word mode with tags:config
Manage configuration files.Usage
Subcommands
| Subcommand | Description |
|---|---|
init | Create a new configuration file with defaults |
show | Show configuration file search paths and current config |
config init Options
| Option | Description |
|---|---|
--path | Path for configuration file (default: ~/.config/myspellchecker/myspellchecker.yaml) |
--force | Overwrite existing configuration file |
Examples
Configuration File Locations
Configuration files are searched in this order:- Path specified with
--configflag - Current directory:
myspellchecker.yaml,myspellchecker.yml, ormyspellchecker.json - User config directory:
~/.config/myspellchecker/myspellchecker.{yaml,yml,json}
infer-pos
Infer POS tags for untagged words in the database using a rule-based engine.Usage
Options
| Option | Short | Description |
|---|---|---|
--db | Database path to update with inferred POS tags (required) | |
--min-frequency | Minimum word frequency for inference (default: 0, infer all) | |
--min-confidence | Minimum confidence threshold, 0.0-1.0 (default: 0.0) | |
--include-tagged | Also infer for words that already have pos_tag (updates inferred_pos only) | |
--dry-run | Show statistics without modifying the database | |
--verbose | -v | Enable verbose output with detailed statistics |
Inference Sources
| Source | Description |
|---|---|
numeral_detection | Myanmar numerals and numeral words |
prefix_pattern | Words with prefix patterns (e.g., အ prefix -> Noun) |
proper_noun_suffix | Proper noun suffixes (country, city names) |
ambiguous_registry | Known ambiguous words (multi-POS) |
morphological | Suffix-based morphological analysis |
Examples
completion
Generate shell completion scripts.Usage
Options
| Option | Description |
|---|---|
--shell | Shell type: bash, zsh, fish (default: bash) |
Examples
Global Options
Available for all commands:| Option | Description |
|---|---|
--help | Show help message |
--verbose/-v is available on most subcommands (check, build, segment, infer-pos) but is defined per-subcommand, not globally.
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success (no errors found, or validation passed) |
| 1 | General runtime error (configuration, data loading, etc.) |
| 2 | Invalid arguments, file not found, or permission error |
| 130 | Process interrupted (Ctrl+C) |
Configuration File
Create~/.config/myspellchecker/myspellchecker.yaml:
Environment Variables
| Variable | Description |
|---|---|
MYSPELL_DATABASE_PATH | Default database path |
MYSPELL_MAX_EDIT_DISTANCE | Max edit distance (1-3) |
MYSPELL_USE_CONTEXT_CHECKER | Enable context validation (true/false) |
Next Steps
- Configuration - Full configuration options
- Data Pipeline - Building dictionaries
- API Reference - Python API