Skip to main content
There is no “one-size-fits-all” Myanmar dictionary. Medical reports, news articles, and social media posts use different vocabularies. mySpellChecker ships without a bundled dictionary intentionally: you build one from your own text corpus using the data pipeline, ensuring the vocabulary matches exactly what your users write.

Why Custom Dictionaries?

Custom dictionaries let you tailor the vocabulary for:
  • Domain terminology - Medical, legal, technical terms
  • Organization names - Company names, product names
  • Regional variations - Dialect-specific words
  • New vocabulary - Recent additions to the language

Building a Dictionary

Using Curated Lexicons

A curated lexicon is a carefully verified list of words that you want to mark as trusted in the database. Words from curated lexicons are stored with is_curated=1, ensuring they are always recognized as valid vocabulary. Key Feature: Curated words are inserted directly into the database before corpus processing. This ensures all curated vocabulary is included regardless of whether they appear in the corpus. Create a curated lexicon CSV file with a word column header:
word
ဆေးရုံ
ဆရာဝန်
လူနာ
ကုမ္ပဏီ
Build with curated lexicon:
# Build database with curated words marked as trusted
myspellchecker build -i corpus.txt -o dictionary.db \
  --curated-input curated_lexicon.csv
The curated lexicon can be combined with other build options:
# Combine corpus + curated lexicon + transformer POS tagging
myspellchecker build -i corpus.txt -o dictionary.db \
  --curated-input data/curated_lexicon.csv \
  --pos-tagger transformer \
  --min-frequency 5
How curated words are processed:
Pipeline Flow:
1. load_curated_words()  →  INSERT (freq=0, is_curated=1, syllables segmented)
2. load_words() (corpus) →  UPDATE frequency, preserve is_curated=1
Scenariofrequencyis_curated
Curated only (not in corpus)01
Curated + corpus overlapcorpus_freq1
Corpus onlycorpus_freq0
Benefits:
  • All curated words are in the database regardless of corpus coverage
  • Frequency is accurate from corpus (when word appears)
  • Syllable segmentation is applied for syllable_count
  • is_curated=1 is preserved even when corpus updates frequency
Preparing curated lexicons: Use the scripts/merge_vocabulary.py utility to merge and deduplicate vocabulary files:
# Merge CSV vocabulary files
python scripts/merge_vocabulary.py /path/to/csv/folder -o data/curated_lexicon.csv

# Merge CSV and TXT files
python scripts/merge_vocabulary.py /path/to/csv -t /path/to/text/files -o data/curated_lexicon.csv

# Append new files to existing lexicon
python scripts/merge_vocabulary.py -t /path/to/new/files --append -o data/curated_lexicon.csv
Priority hierarchy during database build:
  1. Curated words inserted first (--curated-input) → is_curated=1, freq=0
  2. Corpus words loaded → frequency updated, is_curated preserved via MAX()

From Text Corpus

# Prepare a text file with domain content
myspellchecker build --input medical_corpus.txt --output medical.db

# With frequency threshold
myspellchecker build --input corpus.txt --output custom.db --min-frequency 2

From CSV

text,frequency,pos
ဆေးရုံ,5000,N
ဆရာဝန်,3000,N
လူနာ,2500,N
myspellchecker build --input medical_terms.csv --output medical.db

From JSON

{
  "entries": [
    {"text": "ဆေးရုံ", "frequency": 5000, "pos": "N"},
    {"text": "ဆရာဝန်", "frequency": 3000, "pos": "N"}
  ]
}
myspellchecker build --input terms.json --output custom.db

Using Custom Dictionaries

Single Custom Dictionary

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

# Use custom dictionary
provider = SQLiteProvider(database_path="medical.db")
checker = SpellChecker(provider=provider)

result = checker.check("ဆရာဝန်က လူနာကို ကြည့်သည်")

Using Multiple Data Sources

To combine vocabulary from multiple sources, use the data pipeline to merge them into a single database:
from myspellchecker import SpellChecker
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Build a unified database from multiple corpora
pipeline = Pipeline()
pipeline.build_database(
    input_files=["general_corpus.txt", "medical_corpus.txt", "organization_names.txt"],
    database_path="combined.db",
)

# Use the combined database via SQLiteProvider
from myspellchecker.providers import SQLiteProvider
provider = SQLiteProvider(database_path="combined.db")
checker = SpellChecker(provider=provider)

Alternative: Sequential Lookup

For runtime lookup across multiple databases, use custom logic:
from myspellchecker.providers import SQLiteProvider

class MultiProvider:
    """Custom provider that checks multiple databases."""

    def __init__(self, db_paths: list):
        self.providers = [SQLiteProvider(database_path=p) for p in db_paths]

    def is_valid_word(self, word: str) -> bool:
        return any(p.is_valid_word(word) for p in self.providers)

Python API for Building

Basic Pipeline

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Configure pipeline settings
config = PipelineConfig(
    min_frequency=2,
    batch_size=50000,
)

# Create pipeline with config
pipeline = Pipeline(config=config)

# Build database from corpus files
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="custom.db",
)

With POS Tagging

from myspellchecker.core.config import POSTaggerConfig

# Configure POS tagger
pos_config = POSTaggerConfig(
    tagger_type="viterbi",
)

pipeline = Pipeline()
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="custom.db",
    pos_tagger_config=pos_config,
)

Incremental Updates

Add new words without rebuilding from scratch:
pipeline = Pipeline()
pipeline.build_database(
    input_files=["new_words.txt"],
    database_path="existing.db",
    incremental=True,  # Merge with existing data
)

Customizing Dictionary Content

Dictionary content is managed through the data pipeline by modifying your input corpus files. The pipeline builds a fresh database each time, ensuring consistency.

Adding New Words

Add new vocabulary by including them in your corpus or creating a supplementary file:
# Create a supplementary corpus file with domain terms
domain_terms = """
ကုမ္ပဏီအသစ်
အမည်သစ်တစ်ခု
ဝန်ဆောင်မှုသစ်
"""

with open("domain_terms.txt", "w", encoding="utf-8") as f:
    f.write(domain_terms)

# Rebuild database with the new terms included
from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
pipeline.build_database(
    input_files=["main_corpus.txt", "domain_terms.txt"],
    database_path="custom.db",
)

Filtering Low-Frequency Words

Control which words are included using the min_frequency parameter:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Only include words appearing 5+ times
config = PipelineConfig(min_frequency=5)
pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="filtered.db",
)

Combining Multiple Corpora

The recommended approach is to combine corpora at build time rather than merging databases:
from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()

# Combine multiple source files into one database
pipeline.build_database(
    input_files=[
        "general_corpus.txt",
        "domain_specific.txt",
        "organization_names.txt",
    ],
    database_path="merged.db",
)

Validation and Testing

Test Coverage

from myspellchecker.providers import SQLiteProvider

def test_dictionary_coverage(test_words: list, db_path: str) -> dict:
    """Test how many words are covered by dictionary."""
    provider = SQLiteProvider(database_path=db_path)

    found = sum(1 for w in test_words if provider.is_valid_word(w))

    return {
        "total": len(test_words),
        "found": found,
        "coverage": found / len(test_words) * 100,
    }

result = test_dictionary_coverage(domain_terms, "custom.db")
print(f"Coverage: {result['coverage']:.1f}%")

Best Practices

Corpus Quality

  1. Clean input - Remove HTML, special characters
  2. Normalize encoding - Ensure UTF-8, convert Zawgyi
  3. Remove duplicates - Deduplicate sentences
  4. Balance content - Include variety of contexts

Dictionary Size

Use CaseRecommended Size
Quick testing1,000-10,000 words
Domain-specific10,000-50,000 words
General use50,000-200,000 words
Comprehensive200,000+ words

Frequency Thresholds

from myspellchecker.data_pipeline import PipelineConfig

# Choose threshold based on corpus quality
config_rare = PipelineConfig(min_frequency=1)     # Include rare words (noisy)
config_balanced = PipelineConfig(min_frequency=2) # Exclude hapax legomena
config_clean = PipelineConfig(min_frequency=5)    # Only common words

Troubleshooting

Missing Words

from myspellchecker.providers import SQLiteProvider

# Check if word exists
provider = SQLiteProvider(database_path="custom.db")
exists = provider.is_valid_word("ရှာနေသောစကား")
print(f"Word exists: {exists}")

# Check word frequency
freq = provider.get_word_frequency("ရှာနေသောစကား")
print(f"Frequency: {freq}")

Wrong Suggestions

Low-frequency words get lower suggestion priority. To boost frequency for specific words, include them more times in your corpus or create a supplementary file:
# Create supplementary file with repeated important terms
important_terms = "စကား\n" * 100  # Repeat 100 times to boost frequency
with open("boost_terms.txt", "w", encoding="utf-8") as f:
    f.write(important_terms)

# Rebuild database with boosted terms
pipeline.build_database(
    input_files=["corpus.txt", "boost_terms.txt"],
    database_path="custom.db",
)

Large Dictionary Performance

from myspellchecker.core.config import SpellCheckerConfig, AlgorithmCacheConfig

# Increase cache sizes for large dictionaries
config = SpellCheckerConfig(
    cache=AlgorithmCacheConfig(
        word_cache_size=50000,
        syllable_cache_size=20000,
    )
)

See Also