Dictionary Providers

The DictionaryProvider interface abstracts the storage of vocabulary and frequency data. This allows mySpellChecker to run in different environments (server, desktop, mobile, embedded) by swapping the backend.

Provider Comparison

Provider	Storage	Speed	Memory	Use Case
SQLiteProvider	Disk	Fast	Low	Production (default)
MemoryProvider	RAM	Fastest	High	Testing, small dictionaries
JSONProvider	File	Slow	Medium	Development, debugging
CSVProvider	File	Slow	Medium	Data import/export

Types of Providers

1. SQLiteProvider (Default)

Storage: Disk-based (.db file).
Pros: Low memory footprint. Handles massive datasets (millions of n-grams).
Cons: Slightly slower than RAM (but mitigated by caching).
Use Case: General purpose, desktop apps, web servers with limited RAM.

from myspellchecker.providers import SQLiteProvider

# Basic usage
provider = SQLiteProvider("/path/to/dictionary.db")

# With connection pooling
provider = SQLiteProvider(
    database_path="/path/to/dictionary.db",
    pool_max_size=5,           # Maximum connections in pool (default: 5)
    pool_timeout=5.0,          # Checkout timeout in seconds
    sqlite_timeout=30.0,       # SQLite busy timeout
    check_same_thread=False    # Allow multi-threading
)

# Check word validity
is_valid = provider.is_valid_word("မြန်မာ")
if is_valid:
    print(f"Frequency: {provider.get_word_frequency('မြန်မာ')}")
    print(f"POS tag: {provider.get_word_pos('မြန်မာ')}")

# Check syllable validity
is_valid = provider.is_valid_syllable("မြန်")

# Get n-gram probability
prob = provider.get_bigram_probability("မြန်မာ", "နိုင်ငံ")

2. MemoryProvider

Storage: RAM (Python Dictionary).
Pros: Extremely fast (hash map lookup).
Cons: High memory usage. Long startup time (loading data into RAM).
Use Case: High-performance servers where RAM is abundant and latency must be minimized.

from myspellchecker.providers import MemoryProvider

# Create empty provider
provider = MemoryProvider()

# Add data programmatically
provider.add_word("မြန်မာ", frequency=1000)
provider.add_word_pos("မြန်မာ", "N")  # Add POS tag separately
provider.add_syllable("မြန်")
provider.add_bigram("မြန်မာ", "နိုင်ငံ", probability=0.5)

# Load from lists (for testing or custom data)
syllables = [("မြန်", 1000), ("မာ", 800)]  # (syllable, frequency) tuples
words = [("မြန်မာ", 500)]                   # (word, frequency) tuples
bigrams = [("မြန်မာ", "နိုင်ငံ", 0.5)]       # (prev, curr, probability) tuples
provider.load_from_lists(syllables, words, bigrams)

Memory Usage:

Data	Approx. Memory
100K words	~50 MB
1M bigrams	~200 MB
1M trigrams	~300 MB

3. JSONProvider

Storage: JSON file.
Pros: Human-readable, easy to edit/debug.
Cons: Slow to load, memory inefficient for large datasets.
Use Case: Unit testing, small custom vocabularies, config files.

from myspellchecker.providers import JSONProvider

# Load from JSON file
provider = JSONProvider("/path/to/dictionary.json")

JSON Format:

{
  "syllables": {
    "မြန်": 15432,
    "မာ": 12341
  },
  "words": {
    "မြန်မာ": {"frequency": 8752, "syllable_count": 2},
    "နိုင်ငံ": {"frequency": 12341, "syllable_count": 2}
  },
  "bigrams": {
    "သူ|သွား": 0.234,
    "သူ|ဘယ်": 0.012
  }
}

4. CSVProvider

Storage: CSV/TSV file.
Pros: Easy to export from spreadsheets.
Cons: Similar performance issues to JSON for large data.
Use Case: Importing word lists from Excel/Sheets.

from myspellchecker.providers import CSVProvider

provider = CSVProvider(
    syllables_csv="/path/to/syllables.csv",
    words_csv="/path/to/words.csv",
    bigrams_csv="/path/to/bigrams.csv",
)

DictionaryProvider Interface

All providers implement the DictionaryProvider abstract base class:

from myspellchecker.providers import DictionaryProvider

class DictionaryProvider(ABC):
    @abstractmethod
    def is_valid_word(self, word: str) -> bool: ...

    @abstractmethod
    def is_valid_syllable(self, syllable: str) -> bool: ...

    @abstractmethod
    def get_word_frequency(self, word: str) -> int: ...

    @abstractmethod
    def get_syllable_frequency(self, syllable: str) -> int: ...

    @abstractmethod
    def get_word_pos(self, word: str) -> Optional[str]: ...

    @abstractmethod
    def get_bigram_probability(self, prev_word: str, current_word: str) -> float: ...

    @abstractmethod
    def get_trigram_probability(self, w1: str, w2: str, w3: str) -> float: ...

    @abstractmethod
    def get_top_continuations(self, prev_word: str, limit: int = 20) -> List[Tuple[str, float]]: ...

    @abstractmethod
    def get_all_syllables(self) -> Iterator[Tuple[str, int]]: ...

    @abstractmethod
    def get_all_words(self) -> Iterator[Tuple[str, int]]: ...

Configuration

You can switch providers during initialization:

from myspellchecker import SpellChecker
from myspellchecker.providers import MemoryProvider, JSONProvider
from myspellchecker.core.config import SpellCheckerConfig, ProviderConfig

# Direct provider injection with data
provider = MemoryProvider()
# Load data from lists (syllables, words, bigrams)
provider.load_from_lists(
    syllable_list=[("မြန်", 1500), ("မာ", 2300)],
    word_list=[("မြန်မာ", 850)],
)
checker = SpellChecker(provider=provider)

# Via configuration
config = SpellCheckerConfig(
    provider_config=ProviderConfig(
        cache_size=10000,
        pool_max_size=5,
        pool_timeout=5.0,
    )
)
checker = SpellChecker(config=config)

Caching

The SQLiteProvider uses an LRU cache to speed up repeated lookups. Configure via ProviderConfig:

from myspellchecker.core.config.validation_configs import ProviderConfig

provider_config = ProviderConfig(
    cache_size=10000  # Number of cached entries (0 to disable)
)

Performance Benchmarks

Operation	SQLite	Memory	JSON
Word lookup	~0.1ms	~0.01ms	~1ms
Syllable check	~0.05ms	~0.005ms	~0.5ms
Bigram probability	~0.2ms	~0.02ms	~2ms
Suggestions (top 5)	~5ms	~1ms	~50ms

Database Schema

If you wish to inspect the database directly or build one manually, here is the SQLite schema:

`syllables`

Stores unique syllables and their frequencies.

id: Integer (PK)
syllable: Text (Unique)
frequency: Integer

`words`

Stores valid words, frequency data, and POS tags.

id: Integer (PK)
word: Text (Unique)
syllable_count: Integer
frequency: Integer
pos_tag: Text (Optional, e.g., ‘N’, ‘V’)
is_curated: Integer (0 or 1, default 0)
inferred_pos: Text (POS tag from inference)
inferred_confidence: Real (confidence score)
inferred_source: Text (inference method used)

`bigrams`

Stores 2-word sequences and their probabilities.

id: Integer (PK)
word1_id: Integer (FK -> words.id)
word2_id: Integer (FK -> words.id)
probability: Real ( $P(w2|w1)$ )
count: Integer (Raw frequency)

`trigrams`

Stores 3-word sequences.

id: Integer (PK)
word1_id, word2_id, word3_id: Integers (FK -> words.id)
probability: Real ( $P(w3|w1,w2)$ )
count: Integer

`processed_files`

Tracks ingested files for incremental updates.

path: Text (PK)
mtime: Real
size: Integer

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Provider Comparison

Types of Providers

1. SQLiteProvider (Default)

2. MemoryProvider

3. JSONProvider

4. CSVProvider

DictionaryProvider Interface

Configuration

Caching

Performance Benchmarks

Database Schema

`syllables`

`words`

`bigrams`

`trigrams`

`processed_files`

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Provider Comparison

​Types of Providers

​1. SQLiteProvider (Default)

​2. MemoryProvider

​3. JSONProvider

​4. CSVProvider

​DictionaryProvider Interface

​Configuration

​Caching

​Performance Benchmarks

​Database Schema

​syllables

​words

​bigrams

​trigrams

​processed_files

Provider Comparison

Types of Providers

1. SQLiteProvider (Default)

2. MemoryProvider

3. JSONProvider

4. CSVProvider

DictionaryProvider Interface

Configuration

Caching

Performance Benchmarks

Database Schema

`syllables`

`words`

`bigrams`

`trigrams`

`processed_files`