Building Dictionaries

This page covers everything you need to build a dictionary — from a simple CLI command to advanced Python API usage with POS tagging, curated lexicons, and incremental updates.

CLI Reference

Basic Build

# Build from text corpus
myspellchecker build --input corpus.txt --output dict.db

# Build sample database for testing
myspellchecker build --sample

# Build with POS tagging
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

Build Options

Option	Default	Description
`--input FILE`	Required	Input corpus file (TXT, CSV, TSV, JSON, JSONL, Parquet)
`--output FILE`	`mySpellChecker-default.db`	Output SQLite database path
`--sample`	—	Build a small sample database (no input needed)
`--incremental`	—	Update existing database instead of rebuilding
`--min-frequency N`	`50`	Minimum word frequency to include
`--pos-tagger TYPE`	`rule_based`	POS tagger: `rule_based`, `viterbi`, or `transformer`
`--pos-model NAME`	—	HuggingFace model for transformer tagger
`--pos-device ID`	—	GPU device ID for transformer tagger
`--num-workers N`	CPU count	Parallel worker processes
`--batch-size N`	`10000`	Records per processing batch
`--curated-input FILE`	—	CSV file with trusted vocabulary words
`--word-engine TYPE`	`myword`	Word segmentation engine: `myword` or `crf`
`--validate`	—	Pre-flight validation of input (no build)

Run myspellchecker build --help for additional flags including --work-dir, --keep-intermediate, --col, --json-key, --worker-timeout, and --verbose.

Python API

Basic Usage

from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

With Configuration

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    batch_size=10000,        # Records per batch
    num_shards=20,           # Shards for ingestion
    num_workers=4,           # Parallel workers (None = auto-detect)
    min_frequency=50,        # Minimum word frequency to include
    word_engine="myword",    # Word segmentation engine ("myword", "crf", "transformer")
    keep_intermediate=False, # Keep intermediate Arrow files
    text_col="text",         # Column name for CSV/TSV
    json_key="text",         # Key name for JSON
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Building from Multiple Files

pipeline.build_database(
    input_files=[
        "general_corpus.txt",
        "domain_specific.txt",
        "organization_names.txt",
    ],
    database_path="combined.db",
)

POS Tagging

Add POS tags to dictionary entries during build for grammar checking support:

# CLI: rule-based (fast, no dependencies)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger rule_based

# CLI: transformer (highest accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db \
  --pos-tagger transformer \
  --pos-model chuuhtetnaing/myanmar-pos-model \
  --pos-device 0

# Python API
from myspellchecker.core.config import POSTaggerConfig

config = PipelineConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        model_name="chuuhtetnaing/myanmar-pos-model",
        device=0,
    ),
)
pipeline = Pipeline(config=config)
pipeline.build_database(["corpus.txt"], "dict.db")

POS Inference on Existing Database

Apply rule-based POS inference to an existing database without rebuilding:

from myspellchecker.data_pipeline import DatabasePackager

packager = DatabasePackager.from_existing("dictionary.db")
stats = packager.apply_inferred_pos(
    min_frequency=0,
    min_confidence=0.0,
)
packager.close()
print(f"Inferred POS for {stats['inferred']} words")

Curated Lexicons

Curated words are trusted vocabulary inserted directly into the database before corpus processing. They are always recognized as valid regardless of corpus frequency.

Create a Lexicon CSV

word
ဆေးရုံ
ဆရာဝန်
လူနာ
ကုမ္ပဏီ

Build with Curated Words

myspellchecker build --input corpus.txt --output dict.db \
  --curated-input curated_lexicon.csv

How Curated Words are Processed

Scenario	frequency	is_curated
Curated only (not in corpus)	0	1
Curated + corpus overlap	corpus_freq	1
Corpus only	corpus_freq	0

Curated words are inserted first (is_curated=1, frequency=0), then corpus processing updates frequency while preserving the is_curated flag via MAX().

Incremental Updates

Add new data to an existing dictionary without rebuilding from scratch:

myspellchecker build --input new_data.txt --output existing.db --incremental

pipeline.build_database(
    input_files=["new_data.txt"],
    database_path="existing.db",
    incremental=True,
)

The pipeline tracks processed files in a processed_files table to avoid reprocessing.

Output Database Schema

-- Core tables
syllables(id, syllable, frequency)
words(id, word, syllable_count, frequency, pos_tag, is_curated,
      inferred_pos, inferred_confidence, inferred_source)
bigrams(id, word1_id, word2_id, probability, count)
trigrams(id, word1_id, word2_id, word3_id, probability, count)

-- POS probability tables (for Viterbi tagger)
pos_unigrams(pos, probability)
pos_bigrams(pos1, pos2, probability)
pos_trigrams(pos1, pos2, pos3, probability)

-- File tracking (for incremental builds)
processed_files(path, mtime, size)

Query Examples

# Lookup word frequency
cursor.execute("SELECT frequency FROM words WHERE word = ?", ("မြန်မာ",))

# Get bigram probability
cursor.execute("""
    SELECT b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))

Verification

from myspellchecker.data_pipeline import DatabasePackager

packager = DatabasePackager(input_dir, database_path)
packager.connect()
packager.verify_database()
packager.print_stats()
packager.close()

Performance

Build Time

Corpus Size	Build Time	Peak Memory
1M words	~30s	~200MB
10M words	~5min	~500MB
100M words	~45min	~2GB

For large corpora, see Optimization for DuckDB acceleration (3-50x faster frequency counting) and Cython parallelization.

# Tune for large corpora
myspellchecker build --input huge_corpus.txt \
  --num-workers 8 \
  --batch-size 500000

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

CLI Reference

Basic Build

Build Options

Python API

Basic Usage

With Configuration

Building from Multiple Files

POS Tagging

POS Inference on Existing Database

Curated Lexicons

Create a Lexicon CSV

Build with Curated Words

How Curated Words are Processed

Incremental Updates

Output Database Schema

Query Examples

Verification

Performance

Build Time

See Also

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​CLI Reference

​Basic Build

​Build Options

​Python API

​Basic Usage

​With Configuration

​Building from Multiple Files

​POS Tagging

​POS Inference on Existing Database

​Curated Lexicons

​Create a Lexicon CSV

​Build with Curated Words

​How Curated Words are Processed

​Incremental Updates

​Output Database Schema

​Query Examples

​Verification

​Performance

​Build Time

​See Also

CLI Reference

Basic Build

Build Options

Python API

Basic Usage

With Configuration

Building from Multiple Files

POS Tagging

POS Inference on Existing Database

Curated Lexicons

Create a Lexicon CSV

Build with Curated Words

How Curated Words are Processed

Incremental Updates

Output Database Schema

Query Examples

Verification

Performance

Build Time

See Also