DictionaryProvider interface abstracts the storage of vocabulary and frequency data. This allows mySpellChecker to run in different environments (server, desktop, mobile, embedded) by swapping the backend.
Provider Comparison
| Provider | Storage | Speed | Memory | Use Case |
|---|---|---|---|---|
| SQLiteProvider | Disk | Fast | Low | Production (default) |
| MemoryProvider | RAM | Fastest | High | Testing, small dictionaries |
| JSONProvider | File | Slow | Medium | Development, debugging |
| CSVProvider | File | Slow | Medium | Data import/export |
Types of Providers
1. SQLiteProvider (Default)
- Storage: Disk-based (
.dbfile). - Pros: Low memory footprint. Handles massive datasets (millions of n-grams).
- Cons: Slightly slower than RAM (but mitigated by caching).
- Use Case: General purpose, desktop apps, web servers with limited RAM.
2. MemoryProvider
- Storage: RAM (Python Dictionary).
- Pros: Extremely fast (hash map lookup).
- Cons: High memory usage. Long startup time (loading data into RAM).
- Use Case: High-performance servers where RAM is abundant and latency must be minimized.
| Data | Approx. Memory |
|---|---|
| 100K words | ~50 MB |
| 1M bigrams | ~200 MB |
| 1M trigrams | ~300 MB |
3. JSONProvider
- Storage: JSON file.
- Pros: Human-readable, easy to edit/debug.
- Cons: Slow to load, memory inefficient for large datasets.
- Use Case: Unit testing, small custom vocabularies, config files.
4. CSVProvider
- Storage: CSV/TSV file.
- Pros: Easy to export from spreadsheets.
- Cons: Similar performance issues to JSON for large data.
- Use Case: Importing word lists from Excel/Sheets.
DictionaryProvider Interface
All providers implement theDictionaryProvider abstract base class:
Configuration
You can switch providers during initialization:Caching
TheSQLiteProvider uses an LRU cache to speed up repeated lookups. Configure via ProviderConfig:
Performance Benchmarks
| Operation | SQLite | Memory | JSON |
|---|---|---|---|
| Word lookup | ~0.1ms | ~0.01ms | ~1ms |
| Syllable check | ~0.05ms | ~0.005ms | ~0.5ms |
| Bigram probability | ~0.2ms | ~0.02ms | ~2ms |
| Suggestions (top 5) | ~5ms | ~1ms | ~50ms |
Database Schema
If you wish to inspect the database directly or build one manually, here is the SQLite schema:syllables
Stores unique syllables and their frequencies.
id: Integer (PK)syllable: Text (Unique)frequency: Integer
words
Stores valid words, frequency data, and POS tags.
id: Integer (PK)word: Text (Unique)syllable_count: Integerfrequency: Integerpos_tag: Text (Optional, e.g., ‘N’, ‘V’)is_curated: Integer (0 or 1, default 0)inferred_pos: Text (POS tag from inference)inferred_confidence: Real (confidence score)inferred_source: Text (inference method used)
bigrams
Stores 2-word sequences and their probabilities.
id: Integer (PK)word1_id: Integer (FK -> words.id)word2_id: Integer (FK -> words.id)probability: Real ()count: Integer (Raw frequency)
trigrams
Stores 3-word sequences.
id: Integer (PK)word1_id,word2_id,word3_id: Integers (FK -> words.id)probability: Real ()count: Integer
processed_files
Tracks ingested files for incremental updates.
path: Text (PK)mtime: Realsize: Integer