Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
There is no “one-size-fits-all” Myanmar dictionary. Medical reports, news articles, and social media posts use different vocabularies. mySpellChecker ships without a bundled dictionary intentionally: you build one from your own text corpus using the data pipeline, ensuring the vocabulary matches exactly what your users write.
Why Custom Dictionaries?
Custom dictionaries let you tailor the vocabulary for:
- Domain terminology - Medical, legal, technical terms
- Organization names - Company names, product names
- Regional variations - Dialect-specific words
- New vocabulary - Recent additions to the language
Building a Dictionary
Using Curated Lexicons
A curated lexicon is a carefully verified list of words that you want to mark as trusted in the database. Words from curated lexicons are stored with is_curated=1, ensuring they are always recognized as valid vocabulary.
Key Feature: Curated words are inserted directly into the database before corpus processing. This ensures all curated vocabulary is included regardless of whether they appear in the corpus.
Create a curated lexicon CSV file with a word column header:
word
ဆေးရုံ
ဆရာဝန်
လူနာ
ကုမ္ပဏီ
Build with curated lexicon:
# Build database with curated words marked as trusted
myspellchecker build -i corpus.txt -o dictionary.db \
--curated-input curated_lexicon.csv
The curated lexicon can be combined with other build options:
# Combine corpus + curated lexicon + transformer POS tagging
myspellchecker build -i corpus.txt -o dictionary.db \
--curated-input data/curated_lexicon.csv \
--pos-tagger transformer \
--min-frequency 5
How curated words are processed:
Pipeline Flow:
1. load_curated_words() → INSERT (freq=0, is_curated=1, syllables segmented)
2. load_words() (corpus) → UPDATE frequency, preserve is_curated=1
| Scenario | frequency | is_curated |
|---|
| Curated only (not in corpus) | 0 | 1 |
| Curated + corpus overlap | corpus_freq | 1 |
| Corpus only | corpus_freq | 0 |
Benefits:
- All curated words are in the database regardless of corpus coverage
- Frequency is accurate from corpus (when word appears)
- Syllable segmentation is applied for
syllable_count
is_curated=1 is preserved even when corpus updates frequency
Preparing curated lexicons:
Use the scripts/merge_vocabulary.py utility to merge and deduplicate vocabulary files:
# Merge CSV vocabulary files
python scripts/merge_vocabulary.py /path/to/csv/folder -o data/curated_lexicon.csv
# Merge CSV and TXT files
python scripts/merge_vocabulary.py /path/to/csv -t /path/to/text/files -o data/curated_lexicon.csv
# Append new files to existing lexicon
python scripts/merge_vocabulary.py -t /path/to/new/files --append -o data/curated_lexicon.csv
Priority hierarchy during database build:
- Curated words inserted first (
--curated-input) → is_curated=1, freq=0
- Corpus words loaded → frequency updated,
is_curated preserved via MAX()
From Text Corpus
# Prepare a text file with domain content
myspellchecker build --input medical_corpus.txt --output medical.db
# With frequency threshold
myspellchecker build --input corpus.txt --output custom.db --min-frequency 2
From CSV
text,frequency,pos
ဆေးရုံ,5000,N
ဆရာဝန်,3000,N
လူနာ,2500,N
myspellchecker build --input medical_terms.csv --output medical.db
From JSON
{
"entries": [
{"text": "ဆေးရုံ", "frequency": 5000, "pos": "N"},
{"text": "ဆရာဝန်", "frequency": 3000, "pos": "N"}
]
}
myspellchecker build --input terms.json --output custom.db
Using Custom Dictionaries
Single Custom Dictionary
from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
# Use custom dictionary
provider = SQLiteProvider(database_path="medical.db")
checker = SpellChecker(provider=provider)
result = checker.check("ဆရာဝန်က လူနာကို ကြည့်သည်")
Using Multiple Data Sources
To combine vocabulary from multiple sources, use the data pipeline to merge them into a single database:
from myspellchecker import SpellChecker
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
# Build a unified database from multiple corpora
pipeline = Pipeline()
pipeline.build_database(
input_files=["general_corpus.txt", "medical_corpus.txt", "organization_names.txt"],
database_path="combined.db",
)
# Use the combined database via SQLiteProvider
from myspellchecker.providers import SQLiteProvider
provider = SQLiteProvider(database_path="combined.db")
checker = SpellChecker(provider=provider)
Alternative: Sequential Lookup
For runtime lookup across multiple databases, use custom logic:
from myspellchecker.providers import SQLiteProvider
class MultiProvider:
"""Custom provider that checks multiple databases."""
def __init__(self, db_paths: list):
self.providers = [SQLiteProvider(database_path=p) for p in db_paths]
def is_valid_word(self, word: str) -> bool:
return any(p.is_valid_word(word) for p in self.providers)
Python API for Building
Basic Pipeline
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
# Configure pipeline settings
config = PipelineConfig(
min_frequency=2,
batch_size=50000,
)
# Create pipeline with config
pipeline = Pipeline(config=config)
# Build database from corpus files
pipeline.build_database(
input_files=["corpus.txt"],
database_path="custom.db",
)
With POS Tagging
from myspellchecker.core.config import POSTaggerConfig
# Configure POS tagger
pos_config = POSTaggerConfig(
tagger_type="viterbi",
)
pipeline = Pipeline()
pipeline.build_database(
input_files=["corpus.txt"],
database_path="custom.db",
pos_tagger_config=pos_config,
)
Incremental Updates
Add new words without rebuilding from scratch:
pipeline = Pipeline()
pipeline.build_database(
input_files=["new_words.txt"],
database_path="existing.db",
incremental=True, # Merge with existing data
)
Customizing Dictionary Content
Dictionary content is managed through the data pipeline by modifying your input corpus files. The pipeline builds a fresh database each time, ensuring consistency.
Adding New Words
Add new vocabulary by including them in your corpus or creating a supplementary file:
# Create a supplementary corpus file with domain terms
domain_terms = """
ကုမ္ပဏီအသစ်
အမည်သစ်တစ်ခု
ဝန်ဆောင်မှုသစ်
"""
with open("domain_terms.txt", "w", encoding="utf-8") as f:
f.write(domain_terms)
# Rebuild database with the new terms included
from myspellchecker.data_pipeline import Pipeline
pipeline = Pipeline()
pipeline.build_database(
input_files=["main_corpus.txt", "domain_terms.txt"],
database_path="custom.db",
)
Filtering Low-Frequency Words
Control which words are included using the min_frequency parameter:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
# Only include words appearing 5+ times
config = PipelineConfig(min_frequency=5)
pipeline = Pipeline(config=config)
pipeline.build_database(
input_files=["corpus.txt"],
database_path="filtered.db",
)
Combining Multiple Corpora
The recommended approach is to combine corpora at build time rather than merging databases:
from myspellchecker.data_pipeline import Pipeline
pipeline = Pipeline()
# Combine multiple source files into one database
pipeline.build_database(
input_files=[
"general_corpus.txt",
"domain_specific.txt",
"organization_names.txt",
],
database_path="merged.db",
)
Validation and Testing
Test Coverage
from myspellchecker.providers import SQLiteProvider
def test_dictionary_coverage(test_words: list, db_path: str) -> dict:
"""Test how many words are covered by dictionary."""
provider = SQLiteProvider(database_path=db_path)
found = sum(1 for w in test_words if provider.is_valid_word(w))
return {
"total": len(test_words),
"found": found,
"coverage": found / len(test_words) * 100,
}
result = test_dictionary_coverage(domain_terms, "custom.db")
print(f"Coverage: {result['coverage']:.1f}%")
Best Practices
Corpus Quality
- Clean input - Remove HTML, special characters
- Normalize encoding - Ensure UTF-8, convert Zawgyi
- Remove duplicates - Deduplicate sentences
- Balance content - Include variety of contexts
Dictionary Size
| Use Case | Recommended Size |
|---|
| Quick testing | 1,000-10,000 words |
| Domain-specific | 10,000-50,000 words |
| General use | 50,000-200,000 words |
| Comprehensive | 200,000+ words |
Frequency Thresholds
from myspellchecker.data_pipeline import PipelineConfig
# Choose threshold based on corpus quality
config_rare = PipelineConfig(min_frequency=1) # Include rare words (noisy)
config_balanced = PipelineConfig(min_frequency=2) # Exclude hapax legomena
config_clean = PipelineConfig(min_frequency=5) # Only common words
Troubleshooting
Missing Words
from myspellchecker.providers import SQLiteProvider
# Check if word exists
provider = SQLiteProvider(database_path="custom.db")
exists = provider.is_valid_word("ရှာနေသောစကား")
print(f"Word exists: {exists}")
# Check word frequency
freq = provider.get_word_frequency("ရှာနေသောစကား")
print(f"Frequency: {freq}")
Wrong Suggestions
Low-frequency words get lower suggestion priority. To boost frequency for specific words, include them more times in your corpus or create a supplementary file:
# Create supplementary file with repeated important terms
important_terms = "စကား\n" * 100 # Repeat 100 times to boost frequency
with open("boost_terms.txt", "w", encoding="utf-8") as f:
f.write(important_terms)
# Rebuild database with boosted terms
pipeline.build_database(
input_files=["corpus.txt", "boost_terms.txt"],
database_path="custom.db",
)
from myspellchecker.core.config import SpellCheckerConfig, AlgorithmCacheConfig
# Increase cache sizes for large dictionaries
config = SpellCheckerConfig(
cache=AlgorithmCacheConfig(
word_cache_size=50000,
syllable_cache_size=20000,
)
)
See Also