Skip to main content
Before you can spell-check Myanmar text, you need a dictionary database built from your corpus. This page walks through both the quick CLI path and the full Python API for fine-grained control.

CLI Reference

Basic Build

# Build from text corpus
myspellchecker build --input corpus.txt --output dict.db

# Build sample database for testing
myspellchecker build --sample

# Build with POS tagging
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

Build Options

OptionDefaultDescription
--input FILERequiredInput corpus file (TXT, CSV, TSV, JSON, JSONL, Parquet)
--output FILEmySpellChecker-default.dbOutput SQLite database path
--sampleBuild a small sample database (no input needed)
--incrementalUpdate existing database instead of rebuilding
--min-frequency N50Minimum word frequency to include
--pos-tagger TYPENonePOS tagger: rule_based, viterbi, or transformer. No POS tagging when omitted.
--pos-model NAMEHuggingFace model for transformer tagger
--pos-device ID-1Device for transformer POS tagger (-1=CPU, 0+=GPU)
--num-workers NCPU countParallel worker processes
--batch-size N10000Records per processing batch
--curated-input FILECSV file with trusted vocabulary words
--word-engine TYPEmywordWord segmentation engine: myword, crf, or transformer
--validatePre-flight validation of input (no build)
--no-enrichSkip enrichment step (confusable pairs, compounds, collocations, register)
--seg-model NAMEHuggingFace model name/path for transformer word segmentation (only with --word-engine=transformer)
--seg-device ID-1Device for transformer segmentation inference (-1=CPU, 0+=GPU)
--curated-lexicon-hfDownload and use official curated lexicon from HuggingFace (thettwe/myspellchecker-resources)
--no-dedupDisable line deduplication during ingestion
--no-desegmentKeep segmentation markers in text
--verbose / -vEnable verbose logging with detailed timing breakdowns
Run myspellchecker build --help for additional flags including --work-dir, --keep-intermediate, --col, --json-key, and --worker-timeout.

Python API

Basic Usage

from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

With Configuration

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    batch_size=10000,        # Records per batch
    num_shards=20,           # Shards for ingestion
    num_workers=4,           # Parallel workers (None = auto-detect)
    min_frequency=50,        # Minimum word frequency to include
    word_engine="myword",    # Word segmentation engine: "myword", "crf", or "transformer"
    keep_intermediate=False, # Keep intermediate Arrow files
    text_col="text",         # Column name for CSV/TSV
    json_key="text",         # Key name for JSON
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Building from Multiple Files

pipeline.build_database(
    input_files=[
        "general_corpus.txt",
        "domain_specific.txt",
        "organization_names.txt",
    ],
    database_path="combined.db",
)

POS Tagging

Add POS tags to dictionary entries during build for grammar checking support:
# CLI: rule-based (fast, no dependencies)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger rule_based

# CLI: transformer (highest accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db \
  --pos-tagger transformer \
  --pos-model chuuhtetnaing/myanmar-pos-model \
  --pos-device 0
# Python API
from myspellchecker.core.config import POSTaggerConfig

config = PipelineConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        model_name="chuuhtetnaing/myanmar-pos-model",
        device=0,
    ),
)
pipeline = Pipeline(config=config)
pipeline.build_database(["corpus.txt"], "dict.db")

POS Inference on Existing Database

Apply rule-based POS inference to an existing database without rebuilding:
from myspellchecker.data_pipeline import DatabasePackager

packager = DatabasePackager.from_existing("dictionary.db")
stats = packager.apply_inferred_pos(
    min_frequency=0,
    min_confidence=0.0,
)
packager.close()
print(f"Inferred POS for {stats['inferred']} words")

Curated Lexicons

Curated words are trusted vocabulary inserted directly into the database before corpus processing. They are always recognized as valid regardless of corpus frequency.

Create a Lexicon CSV

word
ဆေးရုံ
ဆရာဝန်
လူနာ
ကုမ္ပဏီ

Build with Curated Words

myspellchecker build --input corpus.txt --output dict.db \
  --curated-input curated_lexicon.csv

How Curated Words are Processed

Scenariofrequencyis_curated
Curated only (not in corpus)01
Curated + corpus overlapcorpus_freq1
Corpus onlycorpus_freq0
Curated words are inserted first (is_curated=1, frequency=0), then corpus processing updates frequency while preserving the is_curated flag via MAX().

Incremental Updates

Add new data to an existing dictionary without rebuilding from scratch:
myspellchecker build --input new_data.txt --output existing.db --incremental
pipeline.build_database(
    input_files=["new_data.txt"],
    database_path="existing.db",
    incremental=True,
)
The pipeline tracks processed files in a processed_files table to avoid reprocessing.

Enrichment

After frequency computation, the pipeline runs an enrichment step that mines additional linguistic data from the corpus. Disable with --no-enrich on the CLI or enrich=False in PipelineConfig.

What Gets Mined

EnrichmentTablePurpose
Confusable pairsconfusable_pairsPhonetically/orthographically similar word pairs (aspiration swaps, medial swaps, tone marks, nasal endings)
Compound confusionscompound_confusionsWords that may be incorrectly split during segmentation (e.g., “မြန်မာ” split as “မြန်” + “မာ”)
CollocationscollocationsStatistically significant word pairs with PMI/NPMI scores
Register tagsregister_tagsFormal/informal register classification based on marker co-occurrence

Configuration

from myspellchecker.data_pipeline.config import PipelineConfig

config = PipelineConfig(
    # Master toggle
    enrich=True,

    # Individual toggles
    enrich_confusables=True,
    enrich_compounds=True,
    enrich_collocations=True,
    enrich_register=True,
)

Enrichment Thresholds

Fine-tune the mining process via EnrichmentConfig (passed internally from PipelineConfig):
ParameterDefaultDescription
confusable_min_freq50Minimum word frequency to generate confusable variants
confusable_max_freq_ratio1000.0Maximum frequency ratio between pair members
compound_min_freq100Minimum compound frequency to include
compound_min_split_count10Minimum bigram count for split form
compound_min_pmi2.0Minimum PMI for compound pairs
collocation_min_count20Minimum bigram occurrences
collocation_min_pmi3.0Minimum PMI for collocations
register_min_total50Minimum co-occurrence count with register markers
register_threshold0.3Score cutoff for formal/informal classification

Confusable Pair Mining

Generates phonetic/orthographic variants for every word above a frequency threshold, then checks which variants are also valid dictionary words. Context overlap (cosine similarity of bigram context vectors) and frequency ratio are computed for each pair. Variant types mined:
  • Aspiration swaps (က↔ခ, ပ↔ဖ, etc.)
  • Medial swaps (ျ↔ြ) and medial insertion/deletion
  • Nasal ending confusion (န်↔မ်↔ံ)
  • Stop-coda confusion
  • Tone mark changes (visarga add/remove)
  • Vowel length changes

Compound Confusion Detection

Finds bigrams (w1, w2) where the concatenation w1+w2 is a high-frequency dictionary word. Computes PMI to measure how strongly the parts associate:
PMI = log2( P(compound) / (P(w1) × P(w2)) )

Collocation Mining

Extracts statistically significant word pairs using Pointwise Mutual Information. Normalized PMI (NPMI) provides a scale-independent score in [-1, 1].

Register Tagging

Classifies words as formal, informal, or neutral based on co-occurrence with register markers. Words appearing predominantly with formal sentence-final particles are tagged formal, and vice versa.

Output Database Schema

-- Core tables
syllables(id, syllable, frequency)
words(id, word, syllable_count, frequency, pos_tag, is_curated,
      inferred_pos, inferred_confidence, inferred_source)
bigrams(id, word1_id, word2_id, probability, count)
trigrams(id, word1_id, word2_id, word3_id, probability, count)
fourgrams(id, word1_id, word2_id, word3_id, word4_id, probability, count)
fivegrams(id, word1_id, word2_id, word3_id, word4_id, word5_id, probability, count)

-- POS probability tables (for Viterbi tagger)
pos_unigrams(pos, probability)
pos_bigrams(pos1, pos2, probability)
pos_trigrams(pos1, pos2, pos3, probability)

-- Enrichment tables (from --no-enrich to skip)
confusable_pairs(id, word1, word2, confusion_type, context_overlap, freq_ratio, suppress, source)
compound_confusions(id, compound, part1, part2, compound_freq, split_freq, pmi)
collocations(id, word1, word2, pmi, npmi, count)
register_tags(word, register, confidence, formal_count, informal_count)

-- Metadata and file tracking
metadata(key, value)
processed_files(path, mtime, size)

Query Examples

# Lookup word frequency
cursor.execute("SELECT frequency FROM words WHERE word = ?", ("မြန်မာ",))

# Get bigram probability
cursor.execute("""
    SELECT b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))

Verification

from myspellchecker.data_pipeline import DatabasePackager

# Verify an existing database (from_existing() opens the connection)
packager = DatabasePackager.from_existing("output.db")
packager.verify_database()
packager.print_stats()
packager.close()

Performance

For large corpora, see Optimization for DuckDB acceleration (3-15x faster frequency counting) and Cython parallelization.
# Tune for large corpora
myspellchecker build --input huge_corpus.txt \
  --num-workers 8 \
  --batch-size 500000

See Also