N-gram models capture the likelihood of word sequences, enabling detection of “real-word errors” where a correctly spelled word is used incorrectly in context.
Overview
N-gram models capture the likelihood of word sequences, enabling detection of “real-word errors”: correctly spelled words used incorrectly in context.
How It Works
Bigram Model (2-gram)
Calculates the probability of word pairs:
P(w2 | w1) = count(w1, w2) / count(w1)
# Example
P("စား" | "ထမင်း") = count("ထမင်း စား") / count("ထမင်း")
= 8500 / 10000
= 0.85
Trigram Model (3-gram)
Extends to word triplets for richer context:
P(w3 | w1, w2) = count(w1, w2, w3) / count(w1, w2)
4-gram and 5-gram Models
Higher-order N-grams provide deeper context for detecting errors in longer phrases:
P(w4 | w1, w2, w3) = count(w1, w2, w3, w4) / count(w1, w2, w3)
P(w5 | w1, w2, w3, w4) = count(w1, w2, w3, w4, w5) / count(w1, w2, w3, w4)
These are stored in fourgrams and fivegrams database tables and configured via fourgram_threshold and fivegram_threshold in NgramContextConfig (both default to 0.0001). The context checker uses backoff: if a 5-gram probability is available it takes precedence, falling back to 4-gram, trigram, then bigram.
Smoothing Strategies
The library supports multiple smoothing strategies for handling unseen N-grams:
Stupid Backoff (Default)
Fast and effective for most use cases:
P_backoff(w2 | w1) =
P(w2 | w1) if count(w1, w2) > 0
α * P(w2) otherwise
# Where α is the backoff weight (default: 0.4)
Add-K (Laplace) Smoothing
Adds constant k to all counts:
P_smooth(w2 | w1) = (count(w1, w2) + k) / (count(w1) + k * V)
# Where V is vocabulary size, k is smoothing constant
No Smoothing
Returns raw probabilities (for pre-smoothed data):
P(w2 | w1) = count(w1, w2) / count(w1)
# Returns 0 for unseen pairs
Configuration
from myspellchecker.algorithms.ngram_context_checker import (
NgramContextChecker,
SmoothingStrategy,
)
from myspellchecker.core.config import NgramContextConfig
# All thresholds and weights are configured via NgramContextConfig
config = NgramContextConfig(
bigram_threshold=0.0001, # Probability threshold for flagging bigram errors
trigram_threshold=0.0001, # Probability threshold for flagging trigram errors
fourgram_threshold=0.0001, # 4-gram error threshold
fivegram_threshold=0.0001, # 5-gram error threshold
smoothing_strategy="stupid_backoff",
backoff_weight=0.4,
add_k_smoothing=0.0, # For ADD_K strategy (0.0 = disabled by default)
edit_distance_weight=0.6,
probability_weight=0.4,
)
checker = NgramContextChecker(
provider=provider,
config=config, # NgramContextConfig (optional, uses defaults if None)
symspell=symspell, # SymSpell instance for candidate generation (optional)
pos_unigram_probs=None, # POS unigram probabilities (optional)
pos_bigram_probs=None, # POS bigram probabilities (optional)
)
When using SpellCheckerConfig (recommended), NgramContextConfig is created automatically from your config and passed to the checker. The values shown above are the defaults. Individual threshold parameters are not accepted as constructor kwargs; they must go through NgramContextConfig.
Error Detection
The checker uses a two-path detection strategy based on raw trigram availability:
-
Trigram path: When a raw trigram probability exists in the corpus (
P_raw(w3|w1,w2) > 0), the checker uses the trigram-specific threshold (trigram_threshold, default 0.0001) to determine if the word is an error. This avoids false positives from smoothed backoff values.
-
Bigram fallback path: When no raw trigram is found, the checker falls back to bigram probabilities with bidirectional context checking, unigram backoff for common words, and typo neighbor detection via SymSpell.
This design ensures that smoothed backoff values (which are always > 0) do not gate the checker into the trigram path, keeping the bigram heuristics reachable.
# Input: "ထမင်းသွား" (rice go)
# P("သွား" | "ထမင်း") = 0.0001 # Very low
# P("စား" | "ထမင်း") = 0.85 # High
# → Flag "သွား", suggest "စား"
Suggestion Generation
Suggestions are generated by:
- Finding words with higher conditional probability
- Filtering by edit distance (max 2)
- Ranking by combined probability and distance score
suggestions = checker.suggest(
prev_word="ထမင်း",
current_word="သွား",
max_edit_distance=2,
next_word="ပြီ",
)
| Operation | Complexity | Typical Time |
|---|
| Bigram lookup | O(1) | <1ms |
| Trigram lookup | O(1) | <1ms |
| Context analysis | O(n) | ~100ms for avg text |
| Suggestion generation | O(k) | ~50ms |
Database Schema
N-gram data is stored in SQLite using INTEGER foreign keys referencing the words table (not TEXT columns):
-- Bigrams (word IDs reference words.id)
CREATE TABLE bigrams (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1_id INTEGER,
word2_id INTEGER,
probability REAL DEFAULT 0.0,
count INTEGER DEFAULT 0,
FOREIGN KEY(word1_id) REFERENCES words(id),
FOREIGN KEY(word2_id) REFERENCES words(id),
UNIQUE(word1_id, word2_id)
);
-- Trigrams (word IDs reference words.id)
CREATE TABLE trigrams (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1_id INTEGER,
word2_id INTEGER,
word3_id INTEGER,
probability REAL DEFAULT 0.0,
count INTEGER DEFAULT 0,
FOREIGN KEY(word1_id) REFERENCES words(id),
FOREIGN KEY(word2_id) REFERENCES words(id),
FOREIGN KEY(word3_id) REFERENCES words(id),
UNIQUE(word1_id, word2_id, word3_id)
);
To query bigrams for a specific word, use a JOIN:
SELECT w1.word, w2.word, b.probability
FROM bigrams b
JOIN words w1 ON b.word1_id = w1.id
JOIN words w2 ON b.word2_id = w2.id
WHERE w1.word = 'ထမင်း'
ORDER BY b.probability DESC;
NgramContextChecker Methods
| Method | Description |
|---|
get_smoothed_bigram_probability(w1, w2) | Smoothed P(w2 | w1) |
get_smoothed_trigram_probability(w1, w2, w3) | Smoothed P(w3 | w1, w2) |
get_smoothed_fourgram_probability(w1, w2, w3, w4) | Smoothed P(w4 | w1, w2, w3) |
get_smoothed_fivegram_probability(w1, w2, w3, w4, w5) | Smoothed P(w5 | w1, w2, w3, w4) |
get_best_left_probability(prev_words, candidate) | Best left-context probability across available N-gram orders |
get_best_right_probability(candidate, next_words) | Best right-context probability |
is_contextual_error(prev_word, current_word, ...) | Check if word is contextually unlikely |
check_word_in_context(word, prev_words, next_words) | Full context check with suggestions |
compare_contextual_probability(w1, w2, context) | Compare probabilities between two candidates |
suggest(prev_word, current_word, ...) | Generate context-aware suggestions |
clear_context_cache() | Clear all internal probability caches |
See Also