Why Custom Dictionaries?
mySpellChecker does not include a bundled dictionary — you must build one first. Custom dictionaries let you tailor the vocabulary for:- Domain terminology - Medical, legal, technical terms
- Organization names - Company names, product names
- Regional variations - Dialect-specific words
- New vocabulary - Recent additions to the language
Building a Dictionary
Using Curated Lexicons
A curated lexicon is a carefully verified list of words that you want to mark as trusted in the database. Words from curated lexicons are stored withis_curated=1, ensuring they are always recognized as valid vocabulary.
Key Feature: Curated words are inserted directly into the database before corpus processing. This ensures all curated vocabulary is included regardless of whether they appear in the corpus.
Create a curated lexicon CSV file with a word column header:
| Scenario | frequency | is_curated |
|---|---|---|
| Curated only (not in corpus) | 0 | 1 |
| Curated + corpus overlap | corpus_freq | 1 |
| Corpus only | corpus_freq | 0 |
- All curated words are in the database regardless of corpus coverage
- Frequency is accurate from corpus (when word appears)
- Syllable segmentation is applied for
syllable_count is_curated=1is preserved even when corpus updates frequency
scripts/merge_vocabulary.py utility to merge and deduplicate vocabulary files:
- Curated words inserted first (
--curated-input) →is_curated=1,freq=0 - Corpus words loaded → frequency updated,
is_curatedpreserved viaMAX()
From Text Corpus
From CSV
From JSON
Using Custom Dictionaries
Single Custom Dictionary
Using Multiple Data Sources
To combine vocabulary from multiple sources, use the data pipeline to merge them into a single database:Alternative: Sequential Lookup
For runtime lookup across multiple databases, use custom logic:Python API for Building
Basic Pipeline
With POS Tagging
Incremental Updates
Add new words without rebuilding from scratch:Customizing Dictionary Content
Dictionary content is managed through the data pipeline by modifying your input corpus files. The pipeline builds a fresh database each time, ensuring consistency.Adding New Words
Add new vocabulary by including them in your corpus or creating a supplementary file:Filtering Low-Frequency Words
Control which words are included using themin_frequency parameter:
Combining Multiple Corpora
The recommended approach is to combine corpora at build time rather than merging databases:Validation and Testing
Test Coverage
Best Practices
Corpus Quality
- Clean input - Remove HTML, special characters
- Normalize encoding - Ensure UTF-8, convert Zawgyi
- Remove duplicates - Deduplicate sentences
- Balance content - Include variety of contexts
Dictionary Size
| Use Case | Recommended Size |
|---|---|
| Quick testing | 1,000-10,000 words |
| Domain-specific | 10,000-50,000 words |
| General use | 50,000-200,000 words |
| Comprehensive | 200,000+ words |
Frequency Thresholds
Troubleshooting
Missing Words
Wrong Suggestions
Low-frequency words get lower suggestion priority. To boost frequency for specific words, include them more times in your corpus or create a supplementary file:Large Dictionary Performance
See Also
- Data Pipeline - Build process details
- Corpus Format - Input specifications
- Database Schema - Schema reference