Skip to main content
mySpellChecker does not include a bundled dictionary — you build your own from a text corpus. The data pipeline reads raw text, segments it into syllables and words, calculates n-gram probabilities, and packages everything into an optimized SQLite database.

Quick Start

Build a dictionary from the command line:
# Build from a text corpus
myspellchecker build --input corpus.txt --output dict.db

# Build a sample database for testing
myspellchecker build --sample
Or use the Python API for programmatic control:
from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Pipeline Architecture

Dictionary building pipeline: Input Files → Ingestion → Segmentation → Frequency Counting → Packaging → SQLite Dictionary

What the Database Contains

TableContentPurpose
syllablesValid syllables + frequenciesSyllable validation
wordsWords + frequencies + POS tagsWord validation, suggestions
bigramsWord pair probabilitiesContext checking
trigramsWord triple probabilitiesContext checking
pos_unigrams/bigrams/trigramsPOS tag probabilitiesGrammar checking

Next Steps