Skip to main content
The DictionaryProvider interface abstracts the storage of vocabulary and frequency data. This allows mySpellChecker to run in different environments (server, desktop, mobile, embedded) by swapping the backend.

Provider Comparison

ProviderStorageSpeedMemoryUse Case
SQLiteProviderDiskFastLowProduction (default)
MemoryProviderRAMFastestHighTesting, small dictionaries
JSONProviderFileSlowMediumDevelopment, debugging
CSVProviderFileSlowMediumData import/export

Types of Providers

1. SQLiteProvider (Default)

  • Storage: Disk-based (.db file).
  • Pros: Low memory footprint. Handles massive datasets (millions of n-grams).
  • Cons: Slightly slower than RAM (but mitigated by caching).
  • Use Case: General purpose, desktop apps, web servers with limited RAM.
from myspellchecker.providers import SQLiteProvider

# Basic usage
provider = SQLiteProvider("/path/to/dictionary.db")

# With connection pooling
provider = SQLiteProvider(
    database_path="/path/to/dictionary.db",
    pool_max_size=5,           # Maximum connections in pool (default: 5)
    pool_timeout=5.0,          # Checkout timeout in seconds
    sqlite_timeout=30.0,       # SQLite busy timeout
    check_same_thread=False    # Allow multi-threading
)

# Check word validity
is_valid = provider.is_valid_word("မြန်မာ")
if is_valid:
    print(f"Frequency: {provider.get_word_frequency('မြန်မာ')}")
    print(f"POS tag: {provider.get_word_pos('မြန်မာ')}")

# Check syllable validity
is_valid = provider.is_valid_syllable("မြန်")

# Get n-gram probability
prob = provider.get_bigram_probability("မြန်မာ", "နိုင်ငံ")

2. MemoryProvider

  • Storage: RAM (Python Dictionary).
  • Pros: Extremely fast (hash map lookup).
  • Cons: High memory usage. Long startup time (loading data into RAM).
  • Use Case: High-performance servers where RAM is abundant and latency must be minimized.
from myspellchecker.providers import MemoryProvider

# Create empty provider
provider = MemoryProvider()

# Add data programmatically
provider.add_word("မြန်မာ", frequency=1000)
provider.add_word_pos("မြန်မာ", "N")  # Add POS tag separately
provider.add_syllable("မြန်")
provider.add_bigram("မြန်မာ", "နိုင်ငံ", probability=0.5)

# Load from lists (for testing or custom data)
syllables = [("မြန်", 1000), ("မာ", 800)]  # (syllable, frequency) tuples
words = [("မြန်မာ", 500)]                   # (word, frequency) tuples
bigrams = [("မြန်မာ", "နိုင်ငံ", 0.5)]       # (prev, curr, probability) tuples
provider.load_from_lists(syllables, words, bigrams)
Memory Usage:
DataApprox. Memory
100K words~50 MB
1M bigrams~200 MB
1M trigrams~300 MB

3. JSONProvider

  • Storage: JSON file.
  • Pros: Human-readable, easy to edit/debug.
  • Cons: Slow to load, memory inefficient for large datasets.
  • Use Case: Unit testing, small custom vocabularies, config files.
from myspellchecker.providers import JSONProvider

# Load from JSON file
provider = JSONProvider("/path/to/dictionary.json")
JSON Format:
{
  "syllables": {
    "မြန်": 15432,
    "မာ": 12341
  },
  "words": {
    "မြန်မာ": {"frequency": 8752, "syllable_count": 2},
    "နိုင်ငံ": {"frequency": 12341, "syllable_count": 2}
  },
  "bigrams": {
    "သူ|သွား": 0.234,
    "သူ|ဘယ်": 0.012
  }
}

4. CSVProvider

  • Storage: CSV/TSV file.
  • Pros: Easy to export from spreadsheets.
  • Cons: Similar performance issues to JSON for large data.
  • Use Case: Importing word lists from Excel/Sheets.
from myspellchecker.providers import CSVProvider

provider = CSVProvider(
    syllables_csv="/path/to/syllables.csv",
    words_csv="/path/to/words.csv",
    bigrams_csv="/path/to/bigrams.csv",
)

DictionaryProvider Interface

All providers implement the DictionaryProvider abstract base class:
from myspellchecker.providers import DictionaryProvider

class DictionaryProvider(ABC):
    @abstractmethod
    def is_valid_word(self, word: str) -> bool: ...

    @abstractmethod
    def is_valid_syllable(self, syllable: str) -> bool: ...

    @abstractmethod
    def get_word_frequency(self, word: str) -> int: ...

    @abstractmethod
    def get_syllable_frequency(self, syllable: str) -> int: ...

    @abstractmethod
    def get_word_pos(self, word: str) -> Optional[str]: ...

    @abstractmethod
    def get_bigram_probability(self, prev_word: str, current_word: str) -> float: ...

    @abstractmethod
    def get_trigram_probability(self, w1: str, w2: str, w3: str) -> float: ...

    @abstractmethod
    def get_top_continuations(self, prev_word: str, limit: int = 20) -> List[Tuple[str, float]]: ...

    @abstractmethod
    def get_all_syllables(self) -> Iterator[Tuple[str, int]]: ...

    @abstractmethod
    def get_all_words(self) -> Iterator[Tuple[str, int]]: ...

Configuration

You can switch providers during initialization:
from myspellchecker import SpellChecker
from myspellchecker.providers import MemoryProvider, JSONProvider
from myspellchecker.core.config import SpellCheckerConfig, ProviderConfig

# Direct provider injection with data
provider = MemoryProvider()
# Load data from lists (syllables, words, bigrams)
provider.load_from_lists(
    syllable_list=[("မြန်", 1500), ("မာ", 2300)],
    word_list=[("မြန်မာ", 850)],
)
checker = SpellChecker(provider=provider)

# Via configuration
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        cache_size=10000,
        pool_max_size=5,
        pool_timeout=5.0,
    )
)
checker = SpellChecker(config=config)

Caching

The SQLiteProvider uses an LRU cache to speed up repeated lookups. Configure via ProviderConfig:
from myspellchecker.core.config.validation_configs import ProviderConfig

provider_config = ProviderConfig(
    cache_size=10000  # Number of cached entries (0 to disable)
)

Performance Benchmarks

OperationSQLiteMemoryJSON
Word lookup~0.1ms~0.01ms~1ms
Syllable check~0.05ms~0.005ms~0.5ms
Bigram probability~0.2ms~0.02ms~2ms
Suggestions (top 5)~5ms~1ms~50ms

Database Schema

If you wish to inspect the database directly or build one manually, here is the SQLite schema:

syllables

Stores unique syllables and their frequencies.
  • id: Integer (PK)
  • syllable: Text (Unique)
  • frequency: Integer

words

Stores valid words, frequency data, and POS tags.
  • id: Integer (PK)
  • word: Text (Unique)
  • syllable_count: Integer
  • frequency: Integer
  • pos_tag: Text (Optional, e.g., ‘N’, ‘V’)
  • is_curated: Integer (0 or 1, default 0)
  • inferred_pos: Text (POS tag from inference)
  • inferred_confidence: Real (confidence score)
  • inferred_source: Text (inference method used)

bigrams

Stores 2-word sequences and their probabilities.
  • id: Integer (PK)
  • word1_id: Integer (FK -> words.id)
  • word2_id: Integer (FK -> words.id)
  • probability: Real (P(w2w1)P(w2|w1))
  • count: Integer (Raw frequency)

trigrams

Stores 3-word sequences.
  • id: Integer (PK)
  • word1_id, word2_id, word3_id: Integers (FK -> words.id)
  • probability: Real (P(w3w1,w2)P(w3|w1,w2))
  • count: Integer

processed_files

Tracks ingested files for incremental updates.
  • path: Text (PK)
  • mtime: Real
  • size: Integer