Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The DictionaryProvider interface abstracts the storage of vocabulary and frequency data. This allows mySpellChecker to run in different environments (server, desktop, mobile, embedded) by swapping the backend.
Provider Comparison
| Provider | Storage | Speed | Memory | Use Case |
|---|
| SQLiteProvider | Disk | Fast | Low | Production (default) |
| MemoryProvider | RAM | Fastest | High | Testing, small dictionaries |
| JSONProvider | File | Slow | Medium | Development, debugging |
| CSVProvider | File | Slow | Medium | Data import/export |
Types of Providers
1. SQLiteProvider (Default)
- Storage: Disk-based (
.db file).
- Pros: Low memory footprint. Handles massive datasets (millions of N-grams).
- Cons: Slightly slower than RAM (but mitigated by caching).
- Use Case: General purpose, desktop apps, web servers with limited RAM.
from myspellchecker.providers import SQLiteProvider
# Basic usage
provider = SQLiteProvider("/path/to/dictionary.db")
# With connection pooling
provider = SQLiteProvider(
database_path="/path/to/dictionary.db",
pool_max_size=5, # Maximum connections in pool (default: 5)
pool_timeout=5.0, # Checkout timeout in seconds
sqlite_timeout=30.0, # SQLite busy timeout
check_same_thread=False # Allow multi-threading
)
# Check word validity
is_valid = provider.is_valid_word("မြန်မာ")
if is_valid:
print(f"Frequency: {provider.get_word_frequency('မြန်မာ')}")
print(f"POS tag: {provider.get_word_pos('မြန်မာ')}")
# Check syllable validity
is_valid = provider.is_valid_syllable("မြန်")
# Get n-gram probability
prob = provider.get_bigram_probability("မြန်မာ", "နိုင်ငံ")
2. MemoryProvider
- Storage: RAM (Python Dictionary).
- Pros: Fast (hash map lookup).
- Cons: High memory usage. Long startup time (loading data into RAM).
- Use Case: High-performance servers where RAM is abundant and latency must be minimized.
from myspellchecker.providers import MemoryProvider
# Create empty provider
provider = MemoryProvider()
# Add data programmatically
provider.add_word("မြန်မာ", frequency=1000)
provider.add_word_pos("မြန်မာ", "N") # Add POS tag separately
provider.add_syllable("မြန်")
provider.add_bigram("မြန်မာ", "နိုင်ငံ", probability=0.5)
# Load from lists (for testing or custom data)
syllables = [("မြန်", 1000), ("မာ", 800)] # (syllable, frequency) tuples
words = [("မြန်မာ", 500)] # (word, frequency) tuples
bigrams = [("မြန်မာ", "နိုင်ငံ", 0.5)] # (prev, curr, probability) tuples
provider.load_from_lists(syllable_list=syllables, word_list=words, bigram_list=bigrams)
Memory Usage:
| Data | Approx. Memory |
|---|
| 100K words | ~50 MB |
| 1M bigrams | ~200 MB |
| 1M trigrams | ~300 MB |
3. JSONProvider
- Storage: JSON file.
- Pros: Human-readable, easy to edit/debug.
- Cons: Slow to load, memory inefficient for large datasets.
- Use Case: Unit testing, small custom vocabularies, config files.
from myspellchecker.providers import JSONProvider
# Load from JSON file
provider = JSONProvider("/path/to/dictionary.json")
JSON Format:
{
"syllables": {
"မြန်": 15432,
"မာ": 12341
},
"words": {
"မြန်မာ": {"frequency": 8752, "syllable_count": 2},
"နိုင်ငံ": {"frequency": 12341, "syllable_count": 2}
},
"bigrams": {
"သူ|သွား": 0.234,
"သူ|ဘယ်": 0.012
}
}
4. CSVProvider
- Storage: CSV/TSV file.
- Pros: Easy to export from spreadsheets.
- Cons: Similar performance issues to JSON for large data.
- Use Case: Importing word lists from Excel/Sheets.
from myspellchecker.providers import CSVProvider
provider = CSVProvider(
syllables_csv="/path/to/syllables.csv",
words_csv="/path/to/words.csv",
bigrams_csv="/path/to/bigrams.csv",
)
DictionaryProvider Interface
All providers implement the DictionaryProvider abstract base class:
from myspellchecker.providers import DictionaryProvider
class DictionaryProvider(ABC):
@abstractmethod
def is_valid_word(self, word: str) -> bool: ...
@abstractmethod
def is_valid_syllable(self, syllable: str) -> bool: ...
@abstractmethod
def get_word_frequency(self, word: str) -> int: ...
@abstractmethod
def get_syllable_frequency(self, syllable: str) -> int: ...
@abstractmethod
def get_word_pos(self, word: str) -> Optional[str]: ...
@abstractmethod
def get_bigram_probability(self, prev_word: str, current_word: str) -> float: ...
@abstractmethod
def get_trigram_probability(self, w1: str, w2: str, w3: str) -> float: ...
@abstractmethod
def get_fourgram_probability(self, word1: str, word2: str, word3: str, word4: str) -> float: ...
@abstractmethod
def get_fivegram_probability(self, word1: str, word2: str, word3: str, word4: str, word5: str) -> float: ...
@abstractmethod
def get_pos_unigram_probabilities(self) -> dict[str, float]: ...
@abstractmethod
def get_pos_bigram_probabilities(self) -> dict[tuple[str, str], float]: ...
@abstractmethod
def get_pos_trigram_probabilities(self) -> dict[tuple[str, str, str], float]: ...
@abstractmethod
def get_top_continuations(self, prev_word: str, limit: int = 20) -> List[Tuple[str, float]]: ...
@abstractmethod
def get_all_syllables(self) -> Iterator[Tuple[str, int]]: ...
@abstractmethod
def get_all_words(self) -> Iterator[Tuple[str, int]]: ...
Configuration
You can switch providers during initialization:
from myspellchecker import SpellChecker
from myspellchecker.providers import MemoryProvider, JSONProvider
from myspellchecker.core.config import SpellCheckerConfig, ProviderConfig
# Direct provider injection with data
provider = MemoryProvider()
# Load data from lists (syllables, words, bigrams)
provider.load_from_lists(
syllable_list=[("မြန်", 1500), ("မာ", 2300)],
word_list=[("မြန်မာ", 850)],
)
checker = SpellChecker(provider=provider)
# Via configuration
config = SpellCheckerConfig(
provider_config=ProviderConfig(
cache_size=10000,
pool_max_size=5,
pool_timeout=5.0,
)
)
checker = SpellChecker(config=config)
Caching
The SQLiteProvider uses an LRU cache to speed up repeated lookups. Configure via ProviderConfig:
from myspellchecker.core.config.validation_configs import ProviderConfig
provider_config = ProviderConfig(
cache_size=10000 # Number of cached entries (0 to disable)
)
| Operation | SQLite | Memory | JSON |
|---|
| Word lookup | ~0.1ms | ~0.01ms | ~1ms |
| Syllable check | ~0.05ms | ~0.005ms | ~0.5ms |
| Bigram probability | ~0.2ms | ~0.02ms | ~2ms |
| Suggestions (top 5) | ~5ms | ~1ms | ~50ms |
Database Schema
If you wish to inspect the database directly or build one manually, here is the SQLite schema:
syllables
Stores unique syllables and their frequencies.
id: Integer (PK)
syllable: Text (Unique)
frequency: Integer
words
Stores valid words, frequency data, and POS tags.
id: Integer (PK)
word: Text (Unique)
syllable_count: Integer
frequency: Integer
pos_tag: Text (Optional, e.g., ‘N’, ‘V’)
is_curated: Integer (0 or 1, default 0)
inferred_pos: Text (POS tag from inference)
inferred_confidence: Real (confidence score)
inferred_source: Text (inference method used)
bigrams
Stores 2-word sequences and their probabilities.
id: Integer (PK)
word1_id: Integer (FK -> words.id)
word2_id: Integer (FK -> words.id)
probability: Real, P(w2 | w1)
count: Integer (Raw frequency)
trigrams
Stores 3-word sequences.
id: Integer (PK)
word1_id, word2_id, word3_id: Integers (FK -> words.id)
probability: Real, P(w3 | w1, w2)
count: Integer
processed_files
Tracks ingested files for incremental updates.
path: Text (PK)
mtime: Real
size: Integer