Skip to main content
mySpellChecker does not include a bundled dictionary. You build your own from a text corpus. The data pipeline reads raw text, segments it into syllables and words, calculates N-gram probabilities, and packages everything into an optimized SQLite database.

Quick Start

Build a dictionary from the command line:
# Build from a text corpus
myspellchecker build --input corpus.txt --output dict.db

# Build a sample database for testing
myspellchecker build --sample
Or use the Python API for programmatic control:
from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Pipeline Architecture

Dictionary building pipeline: Input Files → Ingestion → Segmentation → Frequency Counting → Packaging → SQLite Dictionary

What the Database Contains

TableContentPurpose
syllablesValid syllables + frequenciesSyllable validation
wordsWords + frequencies + POS tagsWord validation, suggestions
bigramsWord pair probabilitiesContext checking (2-gram)
trigramsWord triple probabilitiesContext checking (3-gram)
fourgrams4-word sequence probabilitiesContext checking (4-gram)
fivegrams5-word sequence probabilitiesContext checking (5-gram)
pos_unigrams/bigrams/trigramsPOS tag probabilitiesGrammar checking
metadataKey-value build metadataBuild info, versioning
processed_filesIngested file paths + timestampsIncremental updates
confusable_pairsConfusable word pairs + type + overlapConfusable error detection
compound_confusionsCompound vs split-word pairs + PMICompound error detection
collocationsWord pair co-occurrence + PMI/NPMICollocation error detection
register_tagsWord register labels (formal/informal)Register mixing detection

Next Steps

Corpus Format

Supported input formats and requirements

Building Dictionaries

CLI reference, Python API, config options

Optimization

DuckDB acceleration, Cython, parallel workers

Custom Dictionaries

Curated lexicons, domain-specific builds