The myspellchecker CLI is installed with the package and provides commands for checking text, building dictionaries, training AI models, segmenting text, and managing configuration.
Installation
The CLI is installed automatically with the package:
pip install myspellchecker
myspellchecker --help
Commands Overview
| Command | Description |
|---|
check | Check text for spelling errors (default when no command given) |
build | Build dictionary database from corpus |
train-model | Train a custom semantic model |
segment | Segment text into words |
config | Manage configuration |
infer-pos | Infer POS tags for database |
completion | Generate shell completion |
check
Check text for spelling errors. This is the default command: if no subcommand is recognized, check is assumed.
Usage
myspellchecker check [OPTIONS] [INPUT]
Arguments
| Argument | Description |
|---|
INPUT | Input file path (or stdin if omitted) |
Options
| Option | Short | Description |
|---|
--output | -o | Output file path (default: stdout) |
--format | -f | Output format: json, text, csv, rich (default: rich for TTY, json for pipes) |
--color | | Force color output even when not a TTY |
--no-color | | Disable color output |
--level | | Validation level: syllable, word (default: syllable) |
--db | | Custom database path |
--no-phonetic | | Disable phonetic matching |
--no-context | | Disable context checking |
--no-ner | | Disable Named Entity Recognition |
--ner-model | | HuggingFace model name for transformer NER (default: chuuhtetnaing/myanmar-ner-model). Requires pip install myspellchecker[transformers] |
--ner-device | | Device for NER inference: -1=CPU, 0+=GPU index (default: -1) |
--preset | -p | Configuration preset: default, fast, accurate, minimal, strict |
--verbose | -v | Enable verbose logging |
--config | -c | Path to configuration file (YAML or JSON format) |
Note: --color and --no-color are mutually exclusive.
Examples
# Check a file
myspellchecker check document.txt
# Check with JSON output
myspellchecker check document.txt -f json -o results.json
# Check from stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker check
# Use specific database
myspellchecker check document.txt --db custom.db
# Fast checking (syllable only)
myspellchecker check document.txt --level syllable
# Thorough checking
myspellchecker check document.txt --level word -p accurate
# With custom config file
myspellchecker check document.txt -c config.yaml
# Force color output in a pipe
myspellchecker check document.txt --color | less -R
# Rich formatted output (default in terminal)
myspellchecker check document.txt -f rich
Rich (default in terminal): Colored, formatted output with panels and tables using the Rich library. Auto-selected when running in an interactive terminal.
Text (grep-like):
# WARNING: Myanmar text may not render correctly in your terminal.
# Use a text editor with proper font support to view this output.
document.txt:1:5: invalid_syllable 'xyz' -> Try: [abc, def, ghi]
# Summary: 1 errors found in 10 lines.
JSON (default in pipes):
{
"summary": {
"total_errors": 1,
"total_lines": 10
},
"results": [
{
"file": "document.txt",
"line": 1,
"text": "...",
"has_errors": true,
"errors": [
{
"text": "xyz",
"position": 5,
"error_type": "invalid_syllable",
"suggestions": ["abc", "def", "ghi"],
"confidence": 1.0
}
]
}
]
}
Additional fields appear in each error object depending on the error type: action and message for syllable errors, syllable_count for word errors, and probability and prev_word for context errors.
CSV:
file,line,position,error_type,text,suggestions
document.txt,1,5,invalid_syllable,xyz,"abc,def,ghi"
build
Build a dictionary database from corpus files.
Usage
myspellchecker build [OPTIONS]
Options
| Option | Short | Description |
|---|
--input | -i | Input corpus file(s) (UTF-8 encoded text, CSV, TSV, JSON) |
--output | -o | Output database path (default: mySpellChecker-default.db) |
--work-dir | | Directory for intermediate files (default: temp_build) |
--keep-intermediate | | Keep intermediate files after build |
--sample | | Generate sample corpus for testing |
--col | | Column name/index for CSV/TSV files (default: text) |
--json-key | | Key name for JSON objects (default: text) |
--pos-tagger | | POS tagger type: rule_based, viterbi, transformer |
--pos-model | | HuggingFace model ID or local path for transformer tagger (default: chuuhtetnaing/myanmar-pos-model) |
--pos-device | | Device for transformer POS tagger: -1=CPU, 0+=GPU (default: -1) |
--incremental | | Perform incremental update on existing database |
--curated-input | | Path to curated lexicon CSV file (words marked as is_curated=1) |
--curated-lexicon-hf | | Download and use the official curated lexicon from HuggingFace (thettwe/myspellchecker-resources). Cached after first download. Mutually exclusive with --curated-input |
--word-engine | | Word segmentation engine: myword, crf, transformer (default: myword) |
--seg-model | | HuggingFace model ID or local path for transformer word segmentation (only used when --word-engine=transformer) |
--seg-device | | Device for transformer word segmenter: -1=CPU, 0+=GPU (default: -1, only used when --word-engine=transformer) |
--validate | | Validate inputs only without building (pre-flight check) |
--min-frequency | | Minimum word frequency threshold (default: from config) |
--num-workers | | Number of parallel workers (default: auto-detect based on CPU cores) |
--batch-size | | Batch size for processing (default: 10000) |
--worker-timeout | | Worker timeout in seconds for parallel processing (default: 1800) |
--no-dedup | | Disable line-level deduplication during ingestion |
--no-desegment | | Keep word segmentation markers (spaces/underscores between Myanmar chars) |
--no-enrich | | Skip enrichment step (confusable pairs, compounds, collocations, register tags) |
--verbose | -v | Enable verbose logging with detailed timing breakdowns |
Examples
# Build sample database
myspellchecker build --sample
# Build from corpus file
myspellchecker build -i corpus.txt -o dictionary.db
# Build from multiple files
myspellchecker build -i "data/*.txt" "extra/*.json"
# Build from directory (auto-detects txt, json, jsonl)
myspellchecker build -i ./corpus/ -o dictionary.db
# Validate before building
myspellchecker build -i corpus.txt --validate
# With POS tagging
myspellchecker build -i corpus.txt --pos-tagger viterbi
# Incremental update
myspellchecker build -i new_data.txt -o dictionary.db --incremental
# Filter by frequency
myspellchecker build -i corpus.txt -o dictionary.db --min-frequency 5
# Build with curated lexicon
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv -o dictionary.db
# Combine corpus with curated lexicon and transformer POS tagger
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv \
--pos-tagger transformer --pos-device 0 -o dictionary.db
Build Process
- Ingestion: Read and parse input files
- Segmentation: Break text into syllables and words
- Frequency Calculation: Count occurrences and N-grams
- POS Tagging: Tag words with part-of-speech (if enabled)
- Enrichment: Mine confusable pairs, compounds, collocations, register tags (unless
--no-enrich)
- Packaging: Create optimized SQLite database
train-model
Train semantic models for context checking.
Usage
myspellchecker train-model [OPTIONS]
Options
| Option | Short | Description |
|---|
--input | -i | Input corpus file (required; raw text, one sentence per line) |
--output | -o | Output directory for the model (required) |
--architecture | -a | Model architecture: roberta, bert (default: roberta) |
--epochs | | Number of training epochs (default: 5) |
--batch-size | | Training batch size (default: 16) |
--learning-rate | | Peak learning rate (default: 5e-5) |
--warmup-ratio | | Ratio of steps for LR warmup (default: 0.1) |
--weight-decay | | Weight decay for optimizer (default: 0.01) |
--hidden-size | | Size of hidden layers (default: 256) |
--layers | | Number of transformer layers (default: 4) |
--heads | | Number of attention heads (default: 4) |
--max-length | | Maximum sequence length (default: 128) |
--vocab-size | | Tokenizer vocabulary size (default: 30000) |
--min-frequency | | Minimum token frequency (default: 2) |
--resume | | Resume training from checkpoint directory |
--keep-checkpoints | | Keep intermediate PyTorch checkpoints after export |
--no-metrics | | Disable saving training metrics to JSON |
--streaming | | Use streaming mode for large corpora (constant memory usage) |
--checkpoint-dir | | Persistent checkpoint directory for job resume (e.g. /opt/ml/checkpoints) |
--max-steps | | Cap total training steps (overrides epochs x steps_per_epoch) |
--fp16 | | Enable mixed-precision (FP16) training for faster speed and lower memory |
--gradient-accumulation-steps | | Accumulate gradients over N steps (effective batch = batch-size x N, default: 1) |
Architectures
| Architecture | Description |
|---|
roberta | RoBERTa (default) - Dynamic masking, no NSP |
bert | BERT - Static masking, with NSP capability |
Examples
# Train with default settings (RoBERTa architecture)
myspellchecker train-model -i corpus.txt -o ./models/
# Train BERT model with more epochs
myspellchecker train-model -i corpus.txt -o ./models/ --architecture bert --epochs 10
# Train with custom hyperparameters
myspellchecker train-model -i corpus.txt -o ./models/ \
--learning-rate 3e-5 --warmup-ratio 0.1 --weight-decay 0.01
# Train larger model
myspellchecker train-model -i corpus.txt -o ./models/ \
--hidden-size 512 --layers 6 --heads 8
# Resume training from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
--resume ./models/checkpoints/checkpoint-500
# Keep checkpoints and disable metrics
myspellchecker train-model -i corpus.txt -o ./models/ \
--keep-checkpoints --no-metrics
segment
Segment text into words and optionally tag with POS.
Usage
myspellchecker segment [OPTIONS] [INPUT]
Options
| Option | Short | Description |
|---|
--output | -o | Output file path (default: stdout) |
--format | -f | Output format: text, json, tsv (default: text) |
--tag | | Include POS tags (uses joint segmentation-tagging) |
--db | | Custom database path |
--verbose | -v | Enable verbose logging |
Examples
# Segment text (default text format)
myspellchecker segment document.txt
# Output as JSON
myspellchecker segment document.txt -f json
# Output as TSV
myspellchecker segment document.txt -f tsv
# With POS tags
myspellchecker segment document.txt --tag
# From stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker segment
Output
Word mode with tags:
config
Manage configuration files.
Usage
myspellchecker config [SUBCOMMAND]
Subcommands
| Subcommand | Description |
|---|
init | Create a new configuration file with defaults |
show | Show configuration file search paths and current config |
config init Options
| Option | Description |
|---|
--path | Path for configuration file (default: ~/.config/myspellchecker/myspellchecker.yaml) |
--force | Overwrite existing configuration file |
Examples
# Create config file (default location)
myspellchecker config init
# Create config file at custom path
myspellchecker config init --path ./myspellchecker.yaml
# Overwrite existing config file
myspellchecker config init --force
# Show current config and search paths
myspellchecker config show
Configuration File Locations
Configuration files are searched in this order:
- Path specified with
--config flag
- Current directory:
myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json
- User config directory:
~/.config/myspellchecker/myspellchecker.{yaml,yml,json}
infer-pos
Infer POS tags for untagged words in the database using a rule-based engine.
Usage
myspellchecker infer-pos [OPTIONS]
Options
| Option | Short | Description |
|---|
--db | | Database path to update with inferred POS tags (required) |
--min-frequency | | Minimum word frequency for inference (default: 0, infer all) |
--min-confidence | | Minimum confidence threshold, 0.0-1.0 (default: 0.0) |
--include-tagged | | Also infer for words that already have pos_tag (updates inferred_pos only) |
--dry-run | | Show statistics without modifying the database |
--verbose | -v | Enable verbose output with detailed statistics |
Inference Sources
| Source | Description |
|---|
numeral_detection | Myanmar numerals and numeral words |
prefix_pattern | Words with prefix patterns (e.g., အ prefix -> Noun) |
proper_noun_suffix | Proper noun suffixes (country, city names) |
ambiguous_registry | Known ambiguous words (multi-POS) |
morphological | Suffix-based morphological analysis |
Examples
# Infer POS tags for all untagged words
myspellchecker infer-pos --db dictionary.db
# Infer only for high-frequency words
myspellchecker infer-pos --db dictionary.db --min-frequency 10
# Set minimum confidence threshold
myspellchecker infer-pos --db dictionary.db --min-confidence 0.7
# Preview changes without modifying database
myspellchecker infer-pos --db dictionary.db --dry-run
# Include already-tagged words for re-inference
myspellchecker infer-pos --db dictionary.db --include-tagged
completion
Generate shell completion scripts.
Usage
myspellchecker completion --shell [bash|zsh|fish]
Options
| Option | Description |
|---|
--shell | Shell type: bash, zsh, fish (default: bash) |
Examples
# Generate bash completion
myspellchecker completion --shell bash > ~/.bash_completion.d/myspellchecker
source ~/.bash_completion.d/myspellchecker
# Generate zsh completion
myspellchecker completion --shell zsh > ~/.zsh/completions/_myspellchecker
# Generate fish completion
myspellchecker completion --shell fish > ~/.config/fish/completions/myspellchecker.fish
Global Options
Available for all commands:
| Option | Description |
|---|
--help | Show help message |
Note: --verbose/-v is available on most subcommands (check, build, segment, infer-pos) but is defined per-subcommand, not globally.
Exit Codes
| Code | Meaning |
|---|
| 0 | Success (no errors found, or validation passed) |
| 1 | General runtime error (configuration, data loading, etc.) |
| 2 | Invalid arguments, file not found, or permission error |
| 130 | Process interrupted (Ctrl+C) |
Configuration File
Create ~/.config/myspellchecker/myspellchecker.yaml:
# Database path (required - no bundled database included)
database: /path/to/your/custom.db
# Use a preset (default, fast, accurate, minimal, strict)
preset: default
# Core settings
max_edit_distance: 2
max_suggestions: 5
# Feature toggles
use_phonetic: true
use_context_checker: true
# Provider configuration
provider_config:
pool_max_size: 5
pool_timeout: 5.0
Environment Variables
All environment variables use the MYSPELL_ prefix. They override config file values but are overridden by CLI flags.
Core Settings
| Variable | Description | Values |
|---|
MYSPELL_DATABASE_PATH | Default database path | File path |
MYSPELL_MAX_EDIT_DISTANCE | Max edit distance | 1-3 |
MYSPELL_MAX_SUGGESTIONS | Max suggestions returned | Integer >= 1 |
MYSPELL_USE_CONTEXT_CHECKER | Enable context validation | true/false |
MYSPELL_USE_PHONETIC | Enable phonetic matching | true/false |
MYSPELL_USE_NER | Enable Named Entity Recognition | true/false |
MYSPELL_USE_RULE_BASED_VALIDATION | Enable rule-based validation | true/false |
MYSPELL_WORD_ENGINE | Word segmentation engine | myword, crf |
MYSPELL_FALLBACK_TO_EMPTY_PROVIDER | Fall back to empty provider if DB missing | true/false |
MYSPELL_ALLOW_EXTENDED_MYANMAR | Allow Extended Myanmar characters (Shan, Mon) | true/false |
POS Tagger Settings
| Variable | Description | Values |
|---|
MYSPELL_POS_TAGGER_TYPE | POS tagger type | rule_based, viterbi, transformer |
MYSPELL_POS_TAGGER_BEAM_WIDTH | Beam width for Viterbi tagger | Integer >= 1 |
MYSPELL_POS_TAGGER_MODEL_NAME | Transformer model name/path | String |
Provider Settings
| Variable | Description | Values |
|---|
MYSPELL_POOL_MIN_SIZE | Connection pool minimum size | Integer >= 0 |
MYSPELL_POOL_MAX_SIZE | Connection pool maximum size | Integer >= 1 |
Semantic Checker Settings
| Variable | Description | Values |
|---|
MYSPELL_SEMANTIC_MODEL_PATH | Path to ONNX model file | File path |
MYSPELL_SEMANTIC_TOKENIZER_PATH | Path to tokenizer directory | Directory path |
MYSPELL_SEMANTIC_NUM_THREADS | Inference threads | Integer |
MYSPELL_SEMANTIC_PREDICT_TOP_K | Top-K predictions for mask filling | Integer |
MYSPELL_SEMANTIC_CHECK_TOP_K | Top-K candidates to check | Integer |
SymSpell Settings
| Variable | Description | Values |
|---|
MYSPELL_SYMSPELL_PREFIX_LENGTH | Prefix length for SymSpell | 4-10 |
MYSPELL_SYMSPELL_BEAM_WIDTH | Beam width | Integer >= 1 |
MYSPELL_SYMSPELL_USE_WEIGHTED_DISTANCE | Use Myanmar-weighted edit distance | true/false |
N-gram Context Settings
| Variable | Description | Values |
|---|
MYSPELL_NGRAM_BIGRAM_THRESHOLD | Bigram probability threshold | 0.0-1.0 |
MYSPELL_NGRAM_TRIGRAM_THRESHOLD | Trigram probability threshold | 0.0-1.0 |
MYSPELL_NGRAM_RERANK_LEFT_WEIGHT | Left-context rerank weight | 0.0-1.0 |
MYSPELL_NGRAM_RERANK_RIGHT_WEIGHT | Right-context rerank weight | 0.0-1.0 |
Phonetic Settings
| Variable | Description | Values |
|---|
MYSPELL_PHONETIC_BYPASS_THRESHOLD | Phonetic similarity threshold | 0.0-1.0 |
MYSPELL_PHONETIC_EXTRA_DISTANCE | Extra edit distance for phonetic bypass | 0-3 |
Ranker Settings
| Variable | Description | Values |
|---|
MYSPELL_RANKER_UNIFIED_BASE_TYPE | Base ranker type | default, frequency_first, phonetic_first, edit_distance_only |
MYSPELL_RANKER_ENABLE_TARGETED_RERANK_HINTS | Enable targeted rerank hints | true/false |
MYSPELL_RANKER_ENABLE_TARGETED_CANDIDATE_INJECTIONS | Enable targeted candidate injections | true/false |
MYSPELL_RANKER_ENABLE_TARGETED_GRAMMAR_COMPLETION_TEMPLATES | Enable grammar completion templates | true/false |
Next Steps