Skip to main content
The myspellchecker CLI is installed with the package and provides commands for checking text, building dictionaries, training AI models, segmenting text, and managing configuration.

Installation

The CLI is installed automatically with the package:
pip install myspellchecker
myspellchecker --help

Commands Overview

CommandDescription
checkCheck text for spelling errors (default when no command given)
buildBuild dictionary database from corpus
train-modelTrain a custom semantic model
segmentSegment text into words
configManage configuration
infer-posInfer POS tags for database
completionGenerate shell completion

check

Check text for spelling errors. This is the default command: if no subcommand is recognized, check is assumed.

Usage

myspellchecker check [OPTIONS] [INPUT]

Arguments

ArgumentDescription
INPUTInput file path (or stdin if omitted)

Options

OptionShortDescription
--output-oOutput file path (default: stdout)
--format-fOutput format: json, text, csv, rich (default: rich for TTY, json for pipes)
--colorForce color output even when not a TTY
--no-colorDisable color output
--levelValidation level: syllable, word (default: syllable)
--dbCustom database path
--no-phoneticDisable phonetic matching
--no-contextDisable context checking
--no-nerDisable Named Entity Recognition
--ner-modelHuggingFace model name for transformer NER (default: chuuhtetnaing/myanmar-ner-model). Requires pip install myspellchecker[transformers]
--ner-deviceDevice for NER inference: -1=CPU, 0+=GPU index (default: -1)
--preset-pConfiguration preset: default, fast, accurate, minimal, strict
--verbose-vEnable verbose logging
--config-cPath to configuration file (YAML or JSON format)
Note: --color and --no-color are mutually exclusive.

Examples

# Check a file
myspellchecker check document.txt

# Check with JSON output
myspellchecker check document.txt -f json -o results.json

# Check from stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker check

# Use specific database
myspellchecker check document.txt --db custom.db

# Fast checking (syllable only)
myspellchecker check document.txt --level syllable

# Thorough checking
myspellchecker check document.txt --level word -p accurate

# With custom config file
myspellchecker check document.txt -c config.yaml

# Force color output in a pipe
myspellchecker check document.txt --color | less -R

# Rich formatted output (default in terminal)
myspellchecker check document.txt -f rich

Output Formats

Rich (default in terminal): Colored, formatted output with panels and tables using the Rich library. Auto-selected when running in an interactive terminal. Text (grep-like):
# WARNING: Myanmar text may not render correctly in your terminal.
# Use a text editor with proper font support to view this output.

document.txt:1:5: invalid_syllable 'xyz' -> Try: [abc, def, ghi]

# Summary: 1 errors found in 10 lines.
JSON (default in pipes):
{
  "summary": {
    "total_errors": 1,
    "total_lines": 10
  },
  "results": [
    {
      "file": "document.txt",
      "line": 1,
      "text": "...",
      "has_errors": true,
      "errors": [
        {
          "text": "xyz",
          "position": 5,
          "error_type": "invalid_syllable",
          "suggestions": ["abc", "def", "ghi"],
          "confidence": 1.0
        }
      ]
    }
  ]
}
Additional fields appear in each error object depending on the error type: action and message for syllable errors, syllable_count for word errors, and probability and prev_word for context errors.
CSV:
file,line,position,error_type,text,suggestions
document.txt,1,5,invalid_syllable,xyz,"abc,def,ghi"

build

Build a dictionary database from corpus files.

Usage

myspellchecker build [OPTIONS]

Options

OptionShortDescription
--input-iInput corpus file(s) (UTF-8 encoded text, CSV, TSV, JSON)
--output-oOutput database path (default: mySpellChecker-default.db)
--work-dirDirectory for intermediate files (default: temp_build)
--keep-intermediateKeep intermediate files after build
--sampleGenerate sample corpus for testing
--colColumn name/index for CSV/TSV files (default: text)
--json-keyKey name for JSON objects (default: text)
--pos-taggerPOS tagger type: rule_based, viterbi, transformer
--pos-modelHuggingFace model ID or local path for transformer tagger (default: chuuhtetnaing/myanmar-pos-model)
--pos-deviceDevice for transformer POS tagger: -1=CPU, 0+=GPU (default: -1)
--incrementalPerform incremental update on existing database
--curated-inputPath to curated lexicon CSV file (words marked as is_curated=1)
--curated-lexicon-hfDownload and use the official curated lexicon from HuggingFace (thettwe/myspellchecker-resources). Cached after first download. Mutually exclusive with --curated-input
--word-engineWord segmentation engine: myword, crf, transformer (default: myword)
--seg-modelHuggingFace model ID or local path for transformer word segmentation (only used when --word-engine=transformer)
--seg-deviceDevice for transformer word segmenter: -1=CPU, 0+=GPU (default: -1, only used when --word-engine=transformer)
--validateValidate inputs only without building (pre-flight check)
--min-frequencyMinimum word frequency threshold (default: from config)
--num-workersNumber of parallel workers (default: auto-detect based on CPU cores)
--batch-sizeBatch size for processing (default: 10000)
--worker-timeoutWorker timeout in seconds for parallel processing (default: 1800)
--no-dedupDisable line-level deduplication during ingestion
--no-desegmentKeep word segmentation markers (spaces/underscores between Myanmar chars)
--no-enrichSkip enrichment step (confusable pairs, compounds, collocations, register tags)
--verbose-vEnable verbose logging with detailed timing breakdowns

Examples

# Build sample database
myspellchecker build --sample

# Build from corpus file
myspellchecker build -i corpus.txt -o dictionary.db

# Build from multiple files
myspellchecker build -i "data/*.txt" "extra/*.json"

# Build from directory (auto-detects txt, json, jsonl)
myspellchecker build -i ./corpus/ -o dictionary.db

# Validate before building
myspellchecker build -i corpus.txt --validate

# With POS tagging
myspellchecker build -i corpus.txt --pos-tagger viterbi

# Incremental update
myspellchecker build -i new_data.txt -o dictionary.db --incremental

# Filter by frequency
myspellchecker build -i corpus.txt -o dictionary.db --min-frequency 5

# Build with curated lexicon
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv -o dictionary.db

# Combine corpus with curated lexicon and transformer POS tagger
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv \
  --pos-tagger transformer --pos-device 0 -o dictionary.db

Build Process

  1. Ingestion: Read and parse input files
  2. Segmentation: Break text into syllables and words
  3. Frequency Calculation: Count occurrences and N-grams
  4. POS Tagging: Tag words with part-of-speech (if enabled)
  5. Enrichment: Mine confusable pairs, compounds, collocations, register tags (unless --no-enrich)
  6. Packaging: Create optimized SQLite database

train-model

Train semantic models for context checking.

Usage

myspellchecker train-model [OPTIONS]

Options

OptionShortDescription
--input-iInput corpus file (required; raw text, one sentence per line)
--output-oOutput directory for the model (required)
--architecture-aModel architecture: roberta, bert (default: roberta)
--epochsNumber of training epochs (default: 5)
--batch-sizeTraining batch size (default: 16)
--learning-ratePeak learning rate (default: 5e-5)
--warmup-ratioRatio of steps for LR warmup (default: 0.1)
--weight-decayWeight decay for optimizer (default: 0.01)
--hidden-sizeSize of hidden layers (default: 256)
--layersNumber of transformer layers (default: 4)
--headsNumber of attention heads (default: 4)
--max-lengthMaximum sequence length (default: 128)
--vocab-sizeTokenizer vocabulary size (default: 30000)
--min-frequencyMinimum token frequency (default: 2)
--resumeResume training from checkpoint directory
--keep-checkpointsKeep intermediate PyTorch checkpoints after export
--no-metricsDisable saving training metrics to JSON
--streamingUse streaming mode for large corpora (constant memory usage)
--checkpoint-dirPersistent checkpoint directory for job resume (e.g. /opt/ml/checkpoints)
--max-stepsCap total training steps (overrides epochs x steps_per_epoch)
--fp16Enable mixed-precision (FP16) training for faster speed and lower memory
--gradient-accumulation-stepsAccumulate gradients over N steps (effective batch = batch-size x N, default: 1)

Architectures

ArchitectureDescription
robertaRoBERTa (default) - Dynamic masking, no NSP
bertBERT - Static masking, with NSP capability

Examples

# Train with default settings (RoBERTa architecture)
myspellchecker train-model -i corpus.txt -o ./models/

# Train BERT model with more epochs
myspellchecker train-model -i corpus.txt -o ./models/ --architecture bert --epochs 10

# Train with custom hyperparameters
myspellchecker train-model -i corpus.txt -o ./models/ \
    --learning-rate 3e-5 --warmup-ratio 0.1 --weight-decay 0.01

# Train larger model
myspellchecker train-model -i corpus.txt -o ./models/ \
    --hidden-size 512 --layers 6 --heads 8

# Resume training from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
    --resume ./models/checkpoints/checkpoint-500

# Keep checkpoints and disable metrics
myspellchecker train-model -i corpus.txt -o ./models/ \
    --keep-checkpoints --no-metrics

segment

Segment text into words and optionally tag with POS.

Usage

myspellchecker segment [OPTIONS] [INPUT]

Options

OptionShortDescription
--output-oOutput file path (default: stdout)
--format-fOutput format: text, json, tsv (default: text)
--tagInclude POS tags (uses joint segmentation-tagging)
--dbCustom database path
--verbose-vEnable verbose logging

Examples

# Segment text (default text format)
myspellchecker segment document.txt

# Output as JSON
myspellchecker segment document.txt -f json

# Output as TSV
myspellchecker segment document.txt -f tsv

# With POS tags
myspellchecker segment document.txt --tag

# From stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker segment

Output

Word mode with tags:
မြန်မာ/N နိုင်ငံ/N

config

Manage configuration files.

Usage

myspellchecker config [SUBCOMMAND]

Subcommands

SubcommandDescription
initCreate a new configuration file with defaults
showShow configuration file search paths and current config

config init Options

OptionDescription
--pathPath for configuration file (default: ~/.config/myspellchecker/myspellchecker.yaml)
--forceOverwrite existing configuration file

Examples

# Create config file (default location)
myspellchecker config init

# Create config file at custom path
myspellchecker config init --path ./myspellchecker.yaml

# Overwrite existing config file
myspellchecker config init --force

# Show current config and search paths
myspellchecker config show

Configuration File Locations

Configuration files are searched in this order:
  1. Path specified with --config flag
  2. Current directory: myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json
  3. User config directory: ~/.config/myspellchecker/myspellchecker.{yaml,yml,json}

infer-pos

Infer POS tags for untagged words in the database using a rule-based engine.

Usage

myspellchecker infer-pos [OPTIONS]

Options

OptionShortDescription
--dbDatabase path to update with inferred POS tags (required)
--min-frequencyMinimum word frequency for inference (default: 0, infer all)
--min-confidenceMinimum confidence threshold, 0.0-1.0 (default: 0.0)
--include-taggedAlso infer for words that already have pos_tag (updates inferred_pos only)
--dry-runShow statistics without modifying the database
--verbose-vEnable verbose output with detailed statistics

Inference Sources

SourceDescription
numeral_detectionMyanmar numerals and numeral words
prefix_patternWords with prefix patterns (e.g., အ prefix -> Noun)
proper_noun_suffixProper noun suffixes (country, city names)
ambiguous_registryKnown ambiguous words (multi-POS)
morphologicalSuffix-based morphological analysis

Examples

# Infer POS tags for all untagged words
myspellchecker infer-pos --db dictionary.db

# Infer only for high-frequency words
myspellchecker infer-pos --db dictionary.db --min-frequency 10

# Set minimum confidence threshold
myspellchecker infer-pos --db dictionary.db --min-confidence 0.7

# Preview changes without modifying database
myspellchecker infer-pos --db dictionary.db --dry-run

# Include already-tagged words for re-inference
myspellchecker infer-pos --db dictionary.db --include-tagged

completion

Generate shell completion scripts.

Usage

myspellchecker completion --shell [bash|zsh|fish]

Options

OptionDescription
--shellShell type: bash, zsh, fish (default: bash)

Examples

# Generate bash completion
myspellchecker completion --shell bash > ~/.bash_completion.d/myspellchecker
source ~/.bash_completion.d/myspellchecker

# Generate zsh completion
myspellchecker completion --shell zsh > ~/.zsh/completions/_myspellchecker

# Generate fish completion
myspellchecker completion --shell fish > ~/.config/fish/completions/myspellchecker.fish

Global Options

Available for all commands:
OptionDescription
--helpShow help message
Note: --verbose/-v is available on most subcommands (check, build, segment, infer-pos) but is defined per-subcommand, not globally.

Exit Codes

CodeMeaning
0Success (no errors found, or validation passed)
1General runtime error (configuration, data loading, etc.)
2Invalid arguments, file not found, or permission error
130Process interrupted (Ctrl+C)

Configuration File

Create ~/.config/myspellchecker/myspellchecker.yaml:
# Database path (required - no bundled database included)
database: /path/to/your/custom.db

# Use a preset (default, fast, accurate, minimal, strict)
preset: default

# Core settings
max_edit_distance: 2
max_suggestions: 5

# Feature toggles
use_phonetic: true
use_context_checker: true

# Provider configuration
provider_config:
  pool_max_size: 5
  pool_timeout: 5.0

Environment Variables

All environment variables use the MYSPELL_ prefix. They override config file values but are overridden by CLI flags.

Core Settings

VariableDescriptionValues
MYSPELL_DATABASE_PATHDefault database pathFile path
MYSPELL_MAX_EDIT_DISTANCEMax edit distance1-3
MYSPELL_MAX_SUGGESTIONSMax suggestions returnedInteger >= 1
MYSPELL_USE_CONTEXT_CHECKEREnable context validationtrue/false
MYSPELL_USE_PHONETICEnable phonetic matchingtrue/false
MYSPELL_USE_NEREnable Named Entity Recognitiontrue/false
MYSPELL_USE_RULE_BASED_VALIDATIONEnable rule-based validationtrue/false
MYSPELL_WORD_ENGINEWord segmentation enginemyword, crf
MYSPELL_FALLBACK_TO_EMPTY_PROVIDERFall back to empty provider if DB missingtrue/false
MYSPELL_ALLOW_EXTENDED_MYANMARAllow Extended Myanmar characters (Shan, Mon)true/false

POS Tagger Settings

VariableDescriptionValues
MYSPELL_POS_TAGGER_TYPEPOS tagger typerule_based, viterbi, transformer
MYSPELL_POS_TAGGER_BEAM_WIDTHBeam width for Viterbi taggerInteger >= 1
MYSPELL_POS_TAGGER_MODEL_NAMETransformer model name/pathString

Provider Settings

VariableDescriptionValues
MYSPELL_POOL_MIN_SIZEConnection pool minimum sizeInteger >= 0
MYSPELL_POOL_MAX_SIZEConnection pool maximum sizeInteger >= 1

Semantic Checker Settings

VariableDescriptionValues
MYSPELL_SEMANTIC_MODEL_PATHPath to ONNX model fileFile path
MYSPELL_SEMANTIC_TOKENIZER_PATHPath to tokenizer directoryDirectory path
MYSPELL_SEMANTIC_NUM_THREADSInference threadsInteger
MYSPELL_SEMANTIC_PREDICT_TOP_KTop-K predictions for mask fillingInteger
MYSPELL_SEMANTIC_CHECK_TOP_KTop-K candidates to checkInteger

SymSpell Settings

VariableDescriptionValues
MYSPELL_SYMSPELL_PREFIX_LENGTHPrefix length for SymSpell4-10
MYSPELL_SYMSPELL_BEAM_WIDTHBeam widthInteger >= 1
MYSPELL_SYMSPELL_USE_WEIGHTED_DISTANCEUse Myanmar-weighted edit distancetrue/false

N-gram Context Settings

VariableDescriptionValues
MYSPELL_NGRAM_BIGRAM_THRESHOLDBigram probability threshold0.0-1.0
MYSPELL_NGRAM_TRIGRAM_THRESHOLDTrigram probability threshold0.0-1.0
MYSPELL_NGRAM_RERANK_LEFT_WEIGHTLeft-context rerank weight0.0-1.0
MYSPELL_NGRAM_RERANK_RIGHT_WEIGHTRight-context rerank weight0.0-1.0

Phonetic Settings

VariableDescriptionValues
MYSPELL_PHONETIC_BYPASS_THRESHOLDPhonetic similarity threshold0.0-1.0
MYSPELL_PHONETIC_EXTRA_DISTANCEExtra edit distance for phonetic bypass0-3

Ranker Settings

VariableDescriptionValues
MYSPELL_RANKER_UNIFIED_BASE_TYPEBase ranker typedefault, frequency_first, phonetic_first, edit_distance_only
MYSPELL_RANKER_ENABLE_TARGETED_RERANK_HINTSEnable targeted rerank hintstrue/false
MYSPELL_RANKER_ENABLE_TARGETED_CANDIDATE_INJECTIONSEnable targeted candidate injectionstrue/false
MYSPELL_RANKER_ENABLE_TARGETED_GRAMMAR_COMPLETION_TEMPLATESEnable grammar completion templatestrue/false

Next Steps