Overview - mySpellChecker

The myspellchecker CLI is installed with the package and provides commands for checking text, building dictionaries, training AI models, segmenting text, and managing configuration.

Installation

The CLI is installed automatically with the package:

pip install myspellchecker
myspellchecker --help

Commands Overview

Command	Description
`check`	Check text for spelling errors (default when no command given)
`build`	Build dictionary database from corpus
`train-model`	Train a custom semantic model
`train-detector`	Train an error detection model (token classification)
`segment`	Segment text into words
`config`	Manage configuration
`infer-pos`	Infer POS tags for database
`completion`	Generate shell completion

check

Check text for spelling errors. This is the default command: if no subcommand is recognized, check is assumed.

Usage

myspellchecker check [OPTIONS] [INPUT]

Arguments

Argument	Description
`INPUT`	Input file path (or stdin if omitted)

Options

Option	Short	Description
`--output`	`-o`	Output file path (default: stdout)
`--format`	`-f`	Output format: `json`, `text`, `csv`, `rich` (default: `rich` for TTY, `json` for pipes)
`--color`		Force color output even when not a TTY
`--no-color`		Disable color output
`--level`		Validation level: `syllable`, `word` (default: `syllable`)
`--db`		Custom database path
`--no-phonetic`		Disable phonetic matching
`--no-context`		Disable context checking
`--preset`	`-p`	Configuration preset: `default`, `fast`, `accurate`, `minimal`, `strict`
`--verbose`	`-v`	Enable verbose logging
`--config`	`-c`	Path to configuration file (YAML or JSON format)

Note: --color and --no-color are mutually exclusive.

Examples

# Check a file
myspellchecker check document.txt

# Check with JSON output
myspellchecker check document.txt -f json -o results.json

# Check from stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker check

# Use specific database
myspellchecker check document.txt --db custom.db

# Fast checking (syllable only)
myspellchecker check document.txt --level syllable

# Thorough checking
myspellchecker check document.txt --level word -p accurate

# With custom config file
myspellchecker check document.txt -c config.yaml

# Force color output in a pipe
myspellchecker check document.txt --color | less -R

# Rich formatted output (default in terminal)
myspellchecker check document.txt -f rich

Output Formats

Rich (default in terminal): Colored, formatted output with panels and tables using the Rich library. Auto-selected when running in an interactive terminal. Text (grep-like):

# WARNING: Myanmar text may not render correctly in your terminal.
# Use a text editor with proper font support to view this output.

document.txt:1:5: invalid_syllable 'xyz' -> Try: [abc, def, ghi]

# Summary: 1 errors found in 10 lines.

JSON (default in pipes):

{
  "summary": {
    "total_errors": 1,
    "total_lines": 10
  },
  "results": [
    {
      "file": "document.txt",
      "line": 1,
      "text": "...",
      "has_errors": true,
      "errors": [
        {
          "text": "xyz",
          "position": 5,
          "error_type": "invalid_syllable",
          "suggestions": ["abc", "def", "ghi"],
          "confidence": 1.0
        }
      ]
    }
  ]
}

CSV:

file,line,position,error_type,text,suggestions
document.txt,1,5,invalid_syllable,xyz,"abc,def,ghi"

build

Build a dictionary database from corpus files.

Usage

myspellchecker build [OPTIONS]

Options

Option	Short	Description
`--input`	`-i`	Input corpus file(s) (UTF-8 encoded text, CSV, TSV, JSON)
`--output`	`-o`	Output database path (default: `mySpellChecker-default.db`)
`--work-dir`		Directory for intermediate files (default: `temp_build`)
`--keep-intermediate`		Keep intermediate files after build
`--sample`		Generate sample corpus for testing
`--col`		Column name/index for CSV/TSV files (default: `text`)
`--json-key`		Key name for JSON objects (default: `text`)
`--pos-tagger`		POS tagger type: `rule_based`, `viterbi`, `transformer`
`--pos-model`		HuggingFace model ID or local path for transformer tagger
`--pos-device`		Device for transformer POS tagger: `-1`=CPU, `0`+=GPU (default: `-1`)
`--incremental`		Perform incremental update on existing database
`--curated-input`		Path to curated lexicon CSV file (words marked as `is_curated=1`)
`--word-engine`		Word segmentation engine: `myword`, `crf`, `transformer` (default: `myword`)
`--seg-model`		HuggingFace model ID or local path for transformer word segmentation (only used when `--word-engine=transformer`)
`--seg-device`		Device for transformer word segmenter: `-1`=CPU, `0`+=GPU (default: `-1`, only used when `--word-engine=transformer`)
`--validate`		Validate inputs only without building (pre-flight check)
`--min-frequency`		Minimum word frequency threshold (default: from config)
`--num-workers`		Number of parallel workers (default: auto-detect based on CPU cores)
`--batch-size`		Batch size for processing (default: 10000)
`--worker-timeout`		Worker timeout in seconds for parallel processing (default: 300)
`--no-dedup`		Disable line-level deduplication during ingestion
`--no-desegment`		Keep word segmentation markers in output
`--verbose`	`-v`	Enable verbose logging with detailed timing breakdowns

Examples

# Build sample database
myspellchecker build --sample

# Build from corpus file
myspellchecker build -i corpus.txt -o dictionary.db

# Build from multiple files
myspellchecker build -i "data/*.txt" "extra/*.json"

# Build from directory (auto-detects txt, json, jsonl)
myspellchecker build -i ./corpus/ -o dictionary.db

# Validate before building
myspellchecker build -i corpus.txt --validate

# With POS tagging
myspellchecker build -i corpus.txt --pos-tagger viterbi

# Incremental update
myspellchecker build -i new_data.txt -o dictionary.db --incremental

# Filter by frequency
myspellchecker build -i corpus.txt -o dictionary.db --min-frequency 5

# Build with curated lexicon
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv -o dictionary.db

# Combine corpus with curated lexicon and transformer POS tagger
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv \
  --pos-tagger transformer --pos-device 0 -o dictionary.db

Build Process

Ingestion: Read and parse input files
Segmentation: Break text into syllables and words
Frequency Calculation: Count occurrences and N-grams
POS Tagging: Tag words with part-of-speech (if enabled)
Packaging: Create optimized SQLite database

train-model

Train semantic models for context checking.

Usage

myspellchecker train-model [OPTIONS]

Options

Option	Short	Description
`--input`	`-i`	Input corpus file (required; raw text, one sentence per line)
`--output`	`-o`	Output directory for the model (required)
`--architecture`	`-a`	Model architecture: `roberta`, `bert` (default: `roberta`)
`--epochs`		Number of training epochs (default: 5)
`--batch-size`		Training batch size (default: 16)
`--learning-rate`		Peak learning rate (default: 5e-5)
`--warmup-ratio`		Ratio of steps for LR warmup (default: 0.1)
`--weight-decay`		Weight decay for optimizer (default: 0.01)
`--hidden-size`		Size of hidden layers (default: 256)
`--layers`		Number of transformer layers (default: 4)
`--heads`		Number of attention heads (default: 4)
`--max-length`		Maximum sequence length (default: 128)
`--vocab-size`		Tokenizer vocabulary size (default: 15000)
`--min-frequency`		Minimum token frequency (default: 2)
`--resume`		Resume training from checkpoint directory
`--keep-checkpoints`		Keep intermediate PyTorch checkpoints after export
`--no-metrics`		Disable saving training metrics to JSON

Architectures

Architecture	Description
`roberta`	RoBERTa (default) - Dynamic masking, no NSP
`bert`	BERT - Static masking, with NSP capability

Examples

# Train with default settings (RoBERTa architecture)
myspellchecker train-model -i corpus.txt -o ./models/

# Train BERT model with more epochs
myspellchecker train-model -i corpus.txt -o ./models/ --architecture bert --epochs 10

# Train with custom hyperparameters
myspellchecker train-model -i corpus.txt -o ./models/ \
    --learning-rate 3e-5 --warmup-ratio 0.1 --weight-decay 0.01

# Train larger model
myspellchecker train-model -i corpus.txt -o ./models/ \
    --hidden-size 512 --layers 6 --heads 8

# Resume training from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
    --resume ./models/checkpoints/checkpoint-500

# Keep checkpoints and disable metrics
myspellchecker train-model -i corpus.txt -o ./models/ \
    --keep-checkpoints --no-metrics

train-detector

Train an error detection model using token classification (fine-tunes XLM-RoBERTa).

Usage

myspellchecker train-detector [OPTIONS]

Options

Option	Short	Description
`--input`	`-i`	Input corpus file (required; clean text, one sentence per line)
`--output`	`-o`	Output directory for the model (required)
`--base-model`		Base model for fine-tuning (default: `xlm-roberta-base`)
`--epochs`		Number of training epochs (default: 3)
`--batch-size`		Training batch size (default: 16)
`--learning-rate`		Peak learning rate (default: 2e-5)
`--corruption-ratio`		Fraction of words to corrupt per sentence (default: 0.15)
`--max-length`		Maximum sequence length (default: 256)
`--keep-checkpoints`		Keep intermediate PyTorch checkpoints after ONNX export
`--no-metrics`		Disable saving training metrics to JSON
`--seed`		Random seed for reproducibility
`--skip-preprocessing`		Skip corpus preprocessing (Zawgyi conversion, normalization)

Examples

# Train with default settings
myspellchecker train-detector -i corpus.txt -o ./detector/

# Train with more epochs and lower corruption
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --epochs 5 --corruption-ratio 0.10

# Train with custom hyperparameters
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --learning-rate 3e-5 --batch-size 32

# Skip preprocessing for already-clean corpus
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --skip-preprocessing

# Keep checkpoints for debugging
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --keep-checkpoints

Training Process

Load Corpus — Read input file (one sentence per line)
Preprocess — Zawgyi conversion, Unicode normalization, quality filtering
Generate Errors — Create synthetic errors from YAML rules
Build Dataset — Tokenize with subword label alignment
Train Model — Fine-tune XLM-RoBERTa for token classification
Export to ONNX — Quantize and export for inference

segment

Segment text into words and optionally tag with POS.

Usage

myspellchecker segment [OPTIONS] [INPUT]

Options

Option	Short	Description
`--output`	`-o`	Output file path (default: stdout)
`--format`	`-f`	Output format: `text`, `json`, `tsv` (default: `text`)
`--tag`		Include POS tags (uses joint segmentation-tagging)
`--db`		Custom database path
`--verbose`	`-v`	Enable verbose logging

Examples

# Segment text (default text format)
myspellchecker segment document.txt

# Output as JSON
myspellchecker segment document.txt -f json

# Output as TSV
myspellchecker segment document.txt -f tsv

# With POS tags
myspellchecker segment document.txt --tag

# From stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker segment

Output

Word mode with tags:

မြန်မာ/N နိုင်ငံ/N

config

Manage configuration files.

Usage

myspellchecker config [SUBCOMMAND]

Subcommands

Subcommand	Description
`init`	Create a new configuration file with defaults
`show`	Show configuration file search paths and current config

config init Options

Option	Description
`--path`	Path for configuration file (default: `~/.config/myspellchecker/myspellchecker.yaml`)
`--force`	Overwrite existing configuration file

Examples

# Create config file (default location)
myspellchecker config init

# Create config file at custom path
myspellchecker config init --path ./myspellchecker.yaml

# Overwrite existing config file
myspellchecker config init --force

# Show current config and search paths
myspellchecker config show

Configuration File Locations

Configuration files are searched in this order:

Path specified with --config flag
Current directory: myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json
User config directory: ~/.config/myspellchecker/myspellchecker.{yaml,yml,json}

infer-pos

Infer POS tags for untagged words in the database using a rule-based engine.

Usage

myspellchecker infer-pos [OPTIONS]

Options

Option	Short	Description
`--db`		Database path to update with inferred POS tags (required)
`--min-frequency`		Minimum word frequency for inference (default: 0, infer all)
`--min-confidence`		Minimum confidence threshold, 0.0-1.0 (default: 0.0)
`--include-tagged`		Also infer for words that already have `pos_tag` (updates `inferred_pos` only)
`--dry-run`		Show statistics without modifying the database
`--verbose`	`-v`	Enable verbose output with detailed statistics

Inference Sources

Source	Description
`numeral_detection`	Myanmar numerals and numeral words
`prefix_pattern`	Words with prefix patterns (e.g., အ prefix -> Noun)
`proper_noun_suffix`	Proper noun suffixes (country, city names)
`ambiguous_registry`	Known ambiguous words (multi-POS)
`morphological`	Suffix-based morphological analysis

Examples

# Infer POS tags for all untagged words
myspellchecker infer-pos --db dictionary.db

# Infer only for high-frequency words
myspellchecker infer-pos --db dictionary.db --min-frequency 10

# Set minimum confidence threshold
myspellchecker infer-pos --db dictionary.db --min-confidence 0.7

# Preview changes without modifying database
myspellchecker infer-pos --db dictionary.db --dry-run

# Include already-tagged words for re-inference
myspellchecker infer-pos --db dictionary.db --include-tagged

completion

Generate shell completion scripts.

Usage

myspellchecker completion --shell [bash|zsh|fish]

Options

Option	Description
`--shell`	Shell type: `bash`, `zsh`, `fish` (default: `bash`)

Examples

# Generate bash completion
myspellchecker completion --shell bash > ~/.bash_completion.d/myspellchecker
source ~/.bash_completion.d/myspellchecker

# Generate zsh completion
myspellchecker completion --shell zsh > ~/.zsh/completions/_myspellchecker

# Generate fish completion
myspellchecker completion --shell fish > ~/.config/fish/completions/myspellchecker.fish

Global Options

Available for all commands:

Option	Description
`--help`	Show help message

Note: --verbose/-v is available on most subcommands (check, build, segment, infer-pos) but is defined per-subcommand, not globally.

Exit Codes

Code	Meaning
0	Success (no errors found, or validation passed)
1	General runtime error (configuration, data loading, etc.)
2	Invalid arguments, file not found, or permission error
130	Process interrupted (Ctrl+C)

Configuration File

Create ~/.config/myspellchecker/myspellchecker.yaml:

# Database path (required - no bundled database included)
database: /path/to/your/custom.db

# Use a preset (default, fast, accurate, minimal, strict)
preset: default

# Core settings
max_edit_distance: 2
max_suggestions: 5

# Feature toggles
use_phonetic: true
use_context_checker: true

# Provider configuration
provider_config:
  pool_max_size: 5
  pool_timeout: 5.0

Environment Variables

Variable	Description
`MYSPELL_DATABASE_PATH`	Default database path
`MYSPELL_MAX_EDIT_DISTANCE`	Max edit distance (1-3)
`MYSPELL_USE_CONTEXT_CHECKER`	Enable context validation (true/false)

Next Steps

Configuration - Full configuration options
Data Pipeline - Building dictionaries
API Reference - Python API

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Installation

​Commands Overview

​check

​Usage

​Arguments

​Options

​Examples

​Output Formats

​build

​Usage

​Options

​Examples

​Build Process

​train-model

​Usage

​Options

​Architectures

​Examples

​train-detector

​Usage

​Options

​Examples

​Training Process

​segment

​Usage

​Options

​Examples

​Output

​config

​Usage

​Subcommands

​config init Options

​Examples

​Configuration File Locations

​infer-pos

​Usage

​Options

​Inference Sources

​Examples

​completion

​Usage

​Options

​Examples

​Global Options

​Exit Codes

​Configuration File

​Environment Variables

​Next Steps

Installation

Commands Overview

check

Usage

Arguments

Options

Examples

Output Formats

build

Usage

Options

Examples

Build Process

train-model

Usage

Options

Architectures

Examples

train-detector

Usage

Options

Examples

Training Process

segment

Usage

Options

Examples

Output

config

Usage

Subcommands

config init Options

Examples

Configuration File Locations

infer-pos

Usage

Options

Inference Sources

Examples

completion

Usage

Options

Examples

Global Options

Exit Codes

Configuration File

Environment Variables

Next Steps