Skip to main content
mySpellChecker does not include a bundled dictionary — you build your own from a text corpus using the data pipeline. This guide covers building, customizing, and combining dictionaries for domain-specific spell checking.

Why Custom Dictionaries?

mySpellChecker does not include a bundled dictionary — you must build one first. Custom dictionaries let you tailor the vocabulary for:
  • Domain terminology - Medical, legal, technical terms
  • Organization names - Company names, product names
  • Regional variations - Dialect-specific words
  • New vocabulary - Recent additions to the language

Building a Dictionary

Using Curated Lexicons

A curated lexicon is a carefully verified list of words that you want to mark as trusted in the database. Words from curated lexicons are stored with is_curated=1, ensuring they are always recognized as valid vocabulary. Key Feature: Curated words are inserted directly into the database before corpus processing. This ensures all curated vocabulary is included regardless of whether they appear in the corpus. Create a curated lexicon CSV file with a word column header:
word
ဆေးရုံ
ဆရာဝန်
လူနာ
ကုမ္ပဏီ
Build with curated lexicon:
# Build database with curated words marked as trusted
myspellchecker build -i corpus.txt -o dictionary.db \
  --curated-input curated_lexicon.csv
The curated lexicon can be combined with other build options:
# Combine corpus + curated lexicon + transformer POS tagging
myspellchecker build -i corpus.txt -o dictionary.db \
  --curated-input data/curated_lexicon.csv \
  --pos-tagger transformer \
  --min-frequency 5
How curated words are processed:
Pipeline Flow:
1. load_curated_words()  →  INSERT (freq=0, is_curated=1, syllables segmented)
2. load_words() (corpus) →  UPDATE frequency, preserve is_curated=1
Scenariofrequencyis_curated
Curated only (not in corpus)01
Curated + corpus overlapcorpus_freq1
Corpus onlycorpus_freq0
Benefits:
  • All curated words are in the database regardless of corpus coverage
  • Frequency is accurate from corpus (when word appears)
  • Syllable segmentation is applied for syllable_count
  • is_curated=1 is preserved even when corpus updates frequency
Preparing curated lexicons: Use the scripts/merge_vocabulary.py utility to merge and deduplicate vocabulary files:
# Merge CSV vocabulary files
python scripts/merge_vocabulary.py /path/to/csv/folder -o data/curated_lexicon.csv

# Merge CSV and TXT files
python scripts/merge_vocabulary.py /path/to/csv -t /path/to/text/files -o data/curated_lexicon.csv

# Append new files to existing lexicon
python scripts/merge_vocabulary.py -t /path/to/new/files --append -o data/curated_lexicon.csv
Priority hierarchy during database build:
  1. Curated words inserted first (--curated-input) → is_curated=1, freq=0
  2. Corpus words loaded → frequency updated, is_curated preserved via MAX()

From Text Corpus

# Prepare a text file with domain content
myspellchecker build --input medical_corpus.txt --output medical.db

# With frequency threshold
myspellchecker build --input corpus.txt --output custom.db --min-frequency 2

From CSV

text,frequency,pos
ဆေးရုံ,5000,N
ဆရာဝန်,3000,N
လူနာ,2500,N
myspellchecker build --input medical_terms.csv --output medical.db

From JSON

{
  "entries": [
    {"text": "ဆေးရုံ", "frequency": 5000, "pos": "N"},
    {"text": "ဆရာဝန်", "frequency": 3000, "pos": "N"}
  ]
}
myspellchecker build --input terms.json --output custom.db

Using Custom Dictionaries

Single Custom Dictionary

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

# Use custom dictionary
provider = SQLiteProvider(database_path="medical.db")
checker = SpellChecker(provider=provider)

result = checker.check("ဆရာဝန်က လူနာကို ကြည့်သည်")

Using Multiple Data Sources

To combine vocabulary from multiple sources, use the data pipeline to merge them into a single database:
from myspellchecker import SpellChecker
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Build a unified database from multiple corpora
pipeline = Pipeline()
pipeline.build_database(
    input_files=["general_corpus.txt", "medical_corpus.txt", "organization_names.txt"],
    database_path="combined.db",
)

# Use the combined database via SQLiteProvider
from myspellchecker.providers import SQLiteProvider
provider = SQLiteProvider(database_path="combined.db")
checker = SpellChecker(provider=provider)

Alternative: Sequential Lookup

For runtime lookup across multiple databases, use custom logic:
from myspellchecker.providers import SQLiteProvider

class MultiProvider:
    """Custom provider that checks multiple databases."""

    def __init__(self, db_paths: list):
        self.providers = [SQLiteProvider(database_path=p) for p in db_paths]

    def is_valid_word(self, word: str) -> bool:
        return any(p.is_valid_word(word) for p in self.providers)

Python API for Building

Basic Pipeline

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Configure pipeline settings
config = PipelineConfig(
    min_frequency=2,
    batch_size=50000,
)

# Create pipeline with config
pipeline = Pipeline(config=config)

# Build database from corpus files
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="custom.db",
)

With POS Tagging

from myspellchecker.core.config import POSTaggerConfig

# Configure POS tagger
pos_config = POSTaggerConfig(
    tagger_type="viterbi",
)

pipeline = Pipeline()
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="custom.db",
    pos_tagger_config=pos_config,
)

Incremental Updates

Add new words without rebuilding from scratch:
pipeline = Pipeline()
pipeline.build_database(
    input_files=["new_words.txt"],
    database_path="existing.db",
    incremental=True,  # Merge with existing data
)

Customizing Dictionary Content

Dictionary content is managed through the data pipeline by modifying your input corpus files. The pipeline builds a fresh database each time, ensuring consistency.

Adding New Words

Add new vocabulary by including them in your corpus or creating a supplementary file:
# Create a supplementary corpus file with domain terms
domain_terms = """
ကုမ္ပဏီအသစ်
အမည်သစ်တစ်ခု
ဝန်ဆောင်မှုသစ်
"""

with open("domain_terms.txt", "w", encoding="utf-8") as f:
    f.write(domain_terms)

# Rebuild database with the new terms included
from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
pipeline.build_database(
    input_files=["main_corpus.txt", "domain_terms.txt"],
    database_path="custom.db",
)

Filtering Low-Frequency Words

Control which words are included using the min_frequency parameter:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Only include words appearing 5+ times
config = PipelineConfig(min_frequency=5)
pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="filtered.db",
)

Combining Multiple Corpora

The recommended approach is to combine corpora at build time rather than merging databases:
from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()

# Combine multiple source files into one database
pipeline.build_database(
    input_files=[
        "general_corpus.txt",
        "domain_specific.txt",
        "organization_names.txt",
    ],
    database_path="merged.db",
)

Validation and Testing

Test Coverage

from myspellchecker.providers import SQLiteProvider

def test_dictionary_coverage(test_words: list, db_path: str) -> dict:
    """Test how many words are covered by dictionary."""
    provider = SQLiteProvider(database_path=db_path)

    found = sum(1 for w in test_words if provider.is_valid_word(w))

    return {
        "total": len(test_words),
        "found": found,
        "coverage": found / len(test_words) * 100,
    }

result = test_dictionary_coverage(domain_terms, "custom.db")
print(f"Coverage: {result['coverage']:.1f}%")

Best Practices

Corpus Quality

  1. Clean input - Remove HTML, special characters
  2. Normalize encoding - Ensure UTF-8, convert Zawgyi
  3. Remove duplicates - Deduplicate sentences
  4. Balance content - Include variety of contexts

Dictionary Size

Use CaseRecommended Size
Quick testing1,000-10,000 words
Domain-specific10,000-50,000 words
General use50,000-200,000 words
Comprehensive200,000+ words

Frequency Thresholds

from myspellchecker.data_pipeline import PipelineConfig

# Choose threshold based on corpus quality
config_rare = PipelineConfig(min_frequency=1)     # Include rare words (noisy)
config_balanced = PipelineConfig(min_frequency=2) # Exclude hapax legomena
config_clean = PipelineConfig(min_frequency=5)    # Only common words

Troubleshooting

Missing Words

from myspellchecker.providers import SQLiteProvider

# Check if word exists
provider = SQLiteProvider(database_path="custom.db")
exists = provider.is_valid_word("ရှာနေသောစကား")
print(f"Word exists: {exists}")

# Check word frequency
freq = provider.get_word_frequency("ရှာနေသောစကား")
print(f"Frequency: {freq}")

Wrong Suggestions

Low-frequency words get lower suggestion priority. To boost frequency for specific words, include them more times in your corpus or create a supplementary file:
# Create supplementary file with repeated important terms
important_terms = "စကား\n" * 100  # Repeat 100 times to boost frequency
with open("boost_terms.txt", "w", encoding="utf-8") as f:
    f.write(important_terms)

# Rebuild database with boosted terms
pipeline.build_database(
    input_files=["corpus.txt", "boost_terms.txt"],
    database_path="custom.db",
)

Large Dictionary Performance

from myspellchecker.core.config import SpellCheckerConfig, AlgorithmCacheConfig

# Increase cache sizes for large dictionaries
config = SpellCheckerConfig(
    cache=AlgorithmCacheConfig(
        word_cache_size=50000,
        syllable_cache_size=20000,
    )
)

See Also