Viterbi Algorithm - mySpellChecker

This page covers the implementation details, database requirements, beam search tuning, and Cython acceleration for the Viterbi decoder used in mySpellChecker’s POS tagging and joint segmentation pipeline.

Overview

The Viterbi algorithm finds the most likely sequence of hidden states (POS tags) given observed emissions (words). It uses dynamic programming with beam search, running in O(N × B² × T) time where N is the sequence length, B is the beam width, and T is the tagset size.

Requirements

The Viterbi POS tagger requires probability tables in the database to function effectively:

Table	Purpose	Required?
`pos_unigrams`	P(tag) - Tag frequency distribution	Recommended
`pos_bigrams`	P(tag₂ \| tag₁) - Tag transition probabilities	Required
`pos_trigrams`	P(tag₃ \| tag₁, tag₂) - Trigram transitions	Recommended

Building Database with POS Probabilities

POS probability tables are populated when you build a database with a POS tagger:

# Build with Viterbi tagger (uses rule-based for initial tagging)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger viterbi

# Build with transformer tagger (higher accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

# Build sample database (includes pre-computed POS probabilities)
myspellchecker build --sample

Checking Database for POS Tables

import sqlite3

conn = sqlite3.connect("dict.db")
cursor = conn.cursor()

# Check POS table counts
for table in ["pos_unigrams", "pos_bigrams", "pos_trigrams"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"{table}: {count} entries")

conn.close()

Fallback Behavior

If POS probability tables are empty or missing:

Viterbi falls back to morphological analysis for tag prediction
Accuracy drops from ~85% to ~70% (similar to rule-based tagger)
A warning is logged: “Provider does not support bigram probabilities”

How It Works

Trigram Hidden Markov Model

The implementation uses a trigram HMM with deleted interpolation smoothing:

States: POS tag pairs (tag_{i-2}, tag_{i-1})
Observations: Words/syllables
Transitions: P(tag_i | tag_{i-2}, tag_{i-1})  # trigram
Emissions: P(word | tag)

Algorithm Steps

Initialization: Set probabilities for first position
Recursion: For each position, compute best path using trigram transitions with beam pruning
Termination: Find best final state
Backtracking: Reconstruct optimal path

# Pseudocode (trigram with deleted interpolation)
for t in range(1, T):
    for s in beam_states:
        score = smoothed_trigram(tag_prev_prev, tag_prev, s) * emission[s][word[t]]
        # smoothed_trigram interpolates unigram, bigram, and trigram probabilities

Implementation

Python Wrapper

from myspellchecker.algorithms.viterbi import ViterbiTagger

tagger = ViterbiTagger(
    provider=provider,
    pos_bigram_probs=bigram_probs,      # P(t2|t1) transition probabilities
    pos_trigram_probs=trigram_probs,     # P(t3|t1,t2) trigram transitions
    pos_unigram_probs=unigram_probs,    # P(t) unigram probabilities (optional)
    unknown_word_tag="UNK",             # Fallback tag for unknown words (default: "UNK")
    min_prob=1e-10,                     # Minimum probability floor (default: 1e-10)
    beam_width=10,                      # Beam search pruning (default: 10)
    emission_weight=1.2,                # Weight for emission scores (default: 1.2)
    use_morphology_fallback=True,       # Use MorphologyAnalyzer for unknown words
    lambda_unigram=0.1,                 # Deleted interpolation: unigram weight
    lambda_bigram=0.3,                  # Deleted interpolation: bigram weight
    lambda_trigram=0.6,                 # Deleted interpolation: trigram weight
    adaptive_beam=False,                # Enable adaptive beam width
    min_beam_width=5,                   # Minimum beam width for adaptive mode
    max_beam_width=20,                  # Maximum beam width for adaptive mode
)

tags = tagger.tag_sequence(["ကျွန်တော်", "သွားပါမယ်"])
# Output: ["PRON", "VERB"]

Cython Acceleration

The library includes a Cython implementation for 2-5x speedup:

# Automatically uses Cython if available
from myspellchecker.algorithms.viterbi import ViterbiTagger

# Check if Cython is loaded
from myspellchecker.algorithms.viterbi import _HAS_CYTHON_VITERBI
print(_HAS_CYTHON_VITERBI)  # True if Cython extension loaded

Configuration

from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    tagger_type="viterbi",      # Use Viterbi tagger
    beam_width=10,              # Beam search width (default: 10)
)

Note: Unknown word handling is controlled via the min_prob (default: 1e-10) and unknown_word_tag (default: "UNK") parameters on the ViterbiTagger constructor, not via a separate penalty parameter.

Config resolution: When both constructor arguments and a POSTaggerConfig are provided, explicit constructor arguments always take precedence, even if they match the default value. This means ViterbiTagger(beam_width=10, config=config) uses 10 regardless of what config.viterbi_beam_width says. Omitted parameters fall back to the config, then to built-in defaults.

POSTaggerConfig Viterbi Fields

The POSTaggerConfig class exposes Viterbi-specific tuning parameters. These are separate from the constructor parameters on ViterbiTagger — they control the config-driven pipeline integration.

from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    tagger_type="viterbi",
    viterbi_beam_width=10,
    viterbi_emission_weight=1.2,
    viterbi_lambda_trigram=0.6,
)

Field	Type	Default	Description
`viterbi_min_prob`	`float`	`1e-10`	Minimum probability floor to prevent underflow
`viterbi_beam_width`	`int`	`10`	Default beam search width for pruning
`viterbi_emission_weight`	`float`	`1.2`	Weight for emission probabilities in scoring
`viterbi_lambda_unigram`	`float`	`0.1`	Deleted interpolation weight for unigram
`viterbi_lambda_bigram`	`float`	`0.3`	Deleted interpolation weight for bigram
`viterbi_lambda_trigram`	`float`	`0.6`	Deleted interpolation weight for trigram
`viterbi_emission_cache_size`	`int`	`4096`	LRU cache size for emission score lookups
`viterbi_transition_cache_size`	`int`	`2048`	LRU cache size for transition probability lookups
`viterbi_min_beam_width`	`int`	`5`	Minimum beam width for adaptive beam mode
`viterbi_max_beam_width`	`int`	`20`	Maximum beam width for adaptive beam mode
`viterbi_short_sequence_threshold`	`int`	`5`	Sequences at or below this length use max beam
`viterbi_long_sequence_threshold`	`int`	`20`	Sequences above this length use min beam

The lambda_* weights control deleted interpolation smoothing and must sum to 1.0. Higher lambda_trigram gives more weight to trigram context (better for well-trained models), while higher lambda_unigram adds robustness when training data is sparse.

Beam Search

Beam search prunes unlikely paths for efficiency:

# Instead of tracking all states, keep top-k
beam_width = 10  # Default

# At each step, only keep top beam_width paths
# Reduces complexity from O(|S|^T) to O(k * |S| * T)

Beam Width	Speed	Accuracy
5	Fast	~83%
10	Balanced (default)	~85%
20+	Slow	~86%

Joint Segmentation + Tagging

The Viterbi algorithm can jointly segment and tag text:

from myspellchecker.algorithms.joint_segment_tagger import JointSegmentTagger

tagger = JointSegmentTagger(
    provider=provider,
    beam_width=15,
    word_score_weight=1.0,
)

segments, tags = tagger.segment_and_tag("ကျွန်တော်သွားပါမယ်")
# segments: ["ကျွန်တော်", "သွားပါမယ်"]
# tags: ["PRON", "VERB"]

See Joint Segment Tagger for details.

Emission Score Logic

For the first token (t=1), emission scores are handled specially:

# At t=1, emission score = P(word | tag)
# No transition from previous state

# For t>1, score = transition * emission
score = transition[prev_tag][tag] * emission[tag][word]

Performance

Operation	Cython	Python	Speedup
Tag 10 words	1.5ms	5ms	~3x
Tag 100 words	10ms	30ms	~3x
Tag 1000 words	80ms	250ms	~3x

Integration

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="viterbi",
        beam_width=10,
    )
)

checker = SpellChecker(config=config)

​Overview

​Requirements

​Building Database with POS Probabilities

​Checking Database for POS Tables

​Fallback Behavior

​How It Works

​Trigram Hidden Markov Model

​Algorithm Steps

​Implementation

​Python Wrapper

​Cython Acceleration

​Configuration

​POSTaggerConfig Viterbi Fields

​Beam Search

​Joint Segmentation + Tagging

​Emission Score Logic

​Performance

​Integration

​See Also