Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
This page covers the implementation details, database requirements, beam search tuning, and Cython acceleration for the Viterbi decoder used in mySpellChecker’s POS tagging and joint segmentation pipeline.
Overview
The Viterbi algorithm finds the most likely sequence of hidden states (POS tags) given observed emissions (words). It uses dynamic programming with beam search, running in O(N × B² × T) time where N is the sequence length, B is the beam width, and T is the tagset size.
Requirements
The Viterbi POS tagger requires probability tables in the database to function effectively:
| Table | Purpose | Required? |
|---|
pos_unigrams | P(tag) - Tag frequency distribution | Recommended |
pos_bigrams | P(tag₂ | tag₁) - Tag transition probabilities | Required |
pos_trigrams | P(tag₃ | tag₁, tag₂) - Trigram transitions | Recommended |
Building Database with POS Probabilities
POS probability tables are populated when you build a database with a POS tagger:
# Build with Viterbi tagger (uses rule-based for initial tagging)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger viterbi
# Build with transformer tagger (higher accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer
# Build sample database (includes pre-computed POS probabilities)
myspellchecker build --sample
Checking Database for POS Tables
import sqlite3
conn = sqlite3.connect("dict.db")
cursor = conn.cursor()
# Check POS table counts
for table in ["pos_unigrams", "pos_bigrams", "pos_trigrams"]:
cursor.execute(f"SELECT COUNT(*) FROM {table}")
count = cursor.fetchone()[0]
print(f"{table}: {count} entries")
conn.close()
Fallback Behavior
If POS probability tables are empty or missing:
- Viterbi falls back to morphological analysis for tag prediction
- Accuracy drops from ~85% to ~70% (similar to rule-based tagger)
- A warning is logged: “Provider does not support bigram probabilities”
How It Works
Trigram Hidden Markov Model
The implementation uses a trigram HMM with deleted interpolation smoothing:
States: POS tag pairs (tag_{i-2}, tag_{i-1})
Observations: Words/syllables
Transitions: P(tag_i | tag_{i-2}, tag_{i-1}) # trigram
Emissions: P(word | tag)
Algorithm Steps
- Initialization: Set probabilities for first position
- Recursion: For each position, compute best path using trigram transitions with beam pruning
- Termination: Find best final state
- Backtracking: Reconstruct optimal path
# Pseudocode (trigram with deleted interpolation)
for t in range(1, T):
for s in beam_states:
score = smoothed_trigram(tag_prev_prev, tag_prev, s) * emission[s][word[t]]
# smoothed_trigram interpolates unigram, bigram, and trigram probabilities
Implementation
Python Wrapper
from myspellchecker.algorithms.viterbi import ViterbiTagger
tagger = ViterbiTagger(
provider=provider,
pos_bigram_probs=bigram_probs, # P(t2|t1) transition probabilities
pos_trigram_probs=trigram_probs, # P(t3|t1,t2) trigram transitions
pos_unigram_probs=unigram_probs, # P(t) unigram probabilities (optional)
unknown_word_tag="UNK", # Fallback tag for unknown words (default: "UNK")
min_prob=1e-10, # Minimum probability floor (default: 1e-10)
beam_width=10, # Beam search pruning (default: 10)
emission_weight=1.2, # Weight for emission scores (default: 1.2)
use_morphology_fallback=True, # Use MorphologyAnalyzer for unknown words
lambda_unigram=0.1, # Deleted interpolation: unigram weight
lambda_bigram=0.3, # Deleted interpolation: bigram weight
lambda_trigram=0.6, # Deleted interpolation: trigram weight
adaptive_beam=False, # Enable adaptive beam width
min_beam_width=5, # Minimum beam width for adaptive mode
max_beam_width=20, # Maximum beam width for adaptive mode
)
tags = tagger.tag_sequence(["ကျွန်တော်", "သွားပါမယ်"])
# Output: ["PRON", "VERB"]
Cython Acceleration
The library includes a Cython implementation for 2-5x speedup:
# Automatically uses Cython if available
from myspellchecker.algorithms.viterbi import ViterbiTagger
# Check if Cython is loaded
from myspellchecker.algorithms.viterbi import _HAS_CYTHON_VITERBI
print(_HAS_CYTHON_VITERBI) # True if Cython extension loaded
Configuration
from myspellchecker.core.config import POSTaggerConfig
config = POSTaggerConfig(
tagger_type="viterbi", # Use Viterbi tagger
beam_width=10, # Beam search width (default: 10)
)
Note: Unknown word handling is controlled via the min_prob (default: 1e-10) and unknown_word_tag (default: "UNK") parameters on the ViterbiTagger constructor, not via a separate penalty parameter.
Config resolution: When both constructor arguments and a POSTaggerConfig are provided, explicit constructor arguments always take precedence, even if they match the default value. This means ViterbiTagger(beam_width=10, config=config) uses 10 regardless of what config.viterbi_beam_width says. Omitted parameters fall back to the config, then to built-in defaults.
POSTaggerConfig Viterbi Fields
The POSTaggerConfig class exposes Viterbi-specific tuning parameters. These are separate from the constructor parameters on ViterbiTagger — they control the config-driven pipeline integration.
from myspellchecker.core.config import POSTaggerConfig
config = POSTaggerConfig(
tagger_type="viterbi",
viterbi_beam_width=10,
viterbi_emission_weight=1.2,
viterbi_lambda_trigram=0.6,
)
| Field | Type | Default | Description |
|---|
viterbi_min_prob | float | 1e-10 | Minimum probability floor to prevent underflow |
viterbi_beam_width | int | 10 | Default beam search width for pruning |
viterbi_emission_weight | float | 1.2 | Weight for emission probabilities in scoring |
viterbi_lambda_unigram | float | 0.1 | Deleted interpolation weight for unigram |
viterbi_lambda_bigram | float | 0.3 | Deleted interpolation weight for bigram |
viterbi_lambda_trigram | float | 0.6 | Deleted interpolation weight for trigram |
viterbi_emission_cache_size | int | 4096 | LRU cache size for emission score lookups |
viterbi_transition_cache_size | int | 2048 | LRU cache size for transition probability lookups |
viterbi_min_beam_width | int | 5 | Minimum beam width for adaptive beam mode |
viterbi_max_beam_width | int | 20 | Maximum beam width for adaptive beam mode |
viterbi_short_sequence_threshold | int | 5 | Sequences at or below this length use max beam |
viterbi_long_sequence_threshold | int | 20 | Sequences above this length use min beam |
The lambda_* weights control deleted interpolation smoothing and must sum to 1.0. Higher lambda_trigram gives more weight to trigram context (better for well-trained models), while higher lambda_unigram adds robustness when training data is sparse.
Beam Search
Beam search prunes unlikely paths for efficiency:
# Instead of tracking all states, keep top-k
beam_width = 10 # Default
# At each step, only keep top beam_width paths
# Reduces complexity from O(|S|^T) to O(k * |S| * T)
| Beam Width | Speed | Accuracy |
|---|
| 5 | Fast | ~83% |
| 10 | Balanced (default) | ~85% |
| 20+ | Slow | ~86% |
Joint Segmentation + Tagging
The Viterbi algorithm can jointly segment and tag text:
from myspellchecker.algorithms.joint_segment_tagger import JointSegmentTagger
tagger = JointSegmentTagger(
provider=provider,
beam_width=15,
word_score_weight=1.0,
)
segments, tags = tagger.segment_and_tag("ကျွန်တော်သွားပါမယ်")
# segments: ["ကျွန်တော်", "သွားပါမယ်"]
# tags: ["PRON", "VERB"]
See Joint Segment Tagger for details.
Emission Score Logic
For the first token (t=1), emission scores are handled specially:
# At t=1, emission score = P(word | tag)
# No transition from previous state
# For t>1, score = transition * emission
score = transition[prev_tag][tag] * emission[tag][word]
| Operation | Cython | Python | Speedup |
|---|
| Tag 10 words | 1.5ms | 5ms | ~3x |
| Tag 100 words | 10ms | 30ms | ~3x |
| Tag 1000 words | 80ms | 250ms | ~3x |
Integration
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
config = SpellCheckerConfig(
pos_tagger=POSTaggerConfig(
tagger_type="viterbi",
beam_width=10,
)
)
checker = SpellChecker(config=config)
See Also