Skip to main content
This page covers everything you need to build a dictionary — from a simple CLI command to advanced Python API usage with POS tagging, curated lexicons, and incremental updates.

CLI Reference

Basic Build

# Build from text corpus
myspellchecker build --input corpus.txt --output dict.db

# Build sample database for testing
myspellchecker build --sample

# Build with POS tagging
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

Build Options

OptionDefaultDescription
--input FILERequiredInput corpus file (TXT, CSV, TSV, JSON, JSONL, Parquet)
--output FILEmySpellChecker-default.dbOutput SQLite database path
--sampleBuild a small sample database (no input needed)
--incrementalUpdate existing database instead of rebuilding
--min-frequency N50Minimum word frequency to include
--pos-tagger TYPErule_basedPOS tagger: rule_based, viterbi, or transformer
--pos-model NAMEHuggingFace model for transformer tagger
--pos-device IDGPU device ID for transformer tagger
--num-workers NCPU countParallel worker processes
--batch-size N10000Records per processing batch
--curated-input FILECSV file with trusted vocabulary words
--word-engine TYPEmywordWord segmentation engine: myword or crf
--validatePre-flight validation of input (no build)
Run myspellchecker build --help for additional flags including --work-dir, --keep-intermediate, --col, --json-key, --worker-timeout, and --verbose.

Python API

Basic Usage

from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

With Configuration

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    batch_size=10000,        # Records per batch
    num_shards=20,           # Shards for ingestion
    num_workers=4,           # Parallel workers (None = auto-detect)
    min_frequency=50,        # Minimum word frequency to include
    word_engine="myword",    # Word segmentation engine ("myword", "crf", "transformer")
    keep_intermediate=False, # Keep intermediate Arrow files
    text_col="text",         # Column name for CSV/TSV
    json_key="text",         # Key name for JSON
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Building from Multiple Files

pipeline.build_database(
    input_files=[
        "general_corpus.txt",
        "domain_specific.txt",
        "organization_names.txt",
    ],
    database_path="combined.db",
)

POS Tagging

Add POS tags to dictionary entries during build for grammar checking support:
# CLI: rule-based (fast, no dependencies)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger rule_based

# CLI: transformer (highest accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db \
  --pos-tagger transformer \
  --pos-model chuuhtetnaing/myanmar-pos-model \
  --pos-device 0
# Python API
from myspellchecker.core.config import POSTaggerConfig

config = PipelineConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        model_name="chuuhtetnaing/myanmar-pos-model",
        device=0,
    ),
)
pipeline = Pipeline(config=config)
pipeline.build_database(["corpus.txt"], "dict.db")

POS Inference on Existing Database

Apply rule-based POS inference to an existing database without rebuilding:
from myspellchecker.data_pipeline import DatabasePackager

packager = DatabasePackager.from_existing("dictionary.db")
stats = packager.apply_inferred_pos(
    min_frequency=0,
    min_confidence=0.0,
)
packager.close()
print(f"Inferred POS for {stats['inferred']} words")

Curated Lexicons

Curated words are trusted vocabulary inserted directly into the database before corpus processing. They are always recognized as valid regardless of corpus frequency.

Create a Lexicon CSV

word
ဆေးရုံ
ဆရာဝန်
လူနာ
ကုမ္ပဏီ

Build with Curated Words

myspellchecker build --input corpus.txt --output dict.db \
  --curated-input curated_lexicon.csv

How Curated Words are Processed

Scenariofrequencyis_curated
Curated only (not in corpus)01
Curated + corpus overlapcorpus_freq1
Corpus onlycorpus_freq0
Curated words are inserted first (is_curated=1, frequency=0), then corpus processing updates frequency while preserving the is_curated flag via MAX().

Incremental Updates

Add new data to an existing dictionary without rebuilding from scratch:
myspellchecker build --input new_data.txt --output existing.db --incremental
pipeline.build_database(
    input_files=["new_data.txt"],
    database_path="existing.db",
    incremental=True,
)
The pipeline tracks processed files in a processed_files table to avoid reprocessing.

Output Database Schema

-- Core tables
syllables(id, syllable, frequency)
words(id, word, syllable_count, frequency, pos_tag, is_curated,
      inferred_pos, inferred_confidence, inferred_source)
bigrams(id, word1_id, word2_id, probability, count)
trigrams(id, word1_id, word2_id, word3_id, probability, count)

-- POS probability tables (for Viterbi tagger)
pos_unigrams(pos, probability)
pos_bigrams(pos1, pos2, probability)
pos_trigrams(pos1, pos2, pos3, probability)

-- File tracking (for incremental builds)
processed_files(path, mtime, size)

Query Examples

# Lookup word frequency
cursor.execute("SELECT frequency FROM words WHERE word = ?", ("မြန်မာ",))

# Get bigram probability
cursor.execute("""
    SELECT b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))

Verification

from myspellchecker.data_pipeline import DatabasePackager

packager = DatabasePackager(input_dir, database_path)
packager.connect()
packager.verify_database()
packager.print_stats()
packager.close()

Performance

Build Time

Corpus SizeBuild TimePeak Memory
1M words~30s~200MB
10M words~5min~500MB
100M words~45min~2GB
For large corpora, see Optimization for DuckDB acceleration (3-50x faster frequency counting) and Cython parallelization.
# Tune for large corpora
myspellchecker build --input huge_corpus.txt \
  --num-workers 8 \
  --batch-size 500000

See Also