Skip to main content
This page covers the implementation details, database requirements, beam search tuning, and Cython acceleration for the Viterbi decoder used in mySpellChecker’s POS tagging and joint segmentation pipeline.

Overview

The Viterbi algorithm finds the most likely sequence of hidden states (POS tags) given observed emissions (words). It uses dynamic programming with beam search, running in O(N × B² × T) time where N is the sequence length, B is the beam width, and T is the tagset size.

Requirements

The Viterbi POS tagger requires probability tables in the database to function effectively:
TablePurposeRequired?
pos_unigramsP(tag) - Tag frequency distributionRecommended
pos_bigramsP(tag₂ | tag₁) - Tag transition probabilitiesRequired
pos_trigramsP(tag₃ | tag₁, tag₂) - Trigram transitionsRecommended

Building Database with POS Probabilities

POS probability tables are populated when you build a database with a POS tagger:
# Build with Viterbi tagger (uses rule-based for initial tagging)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger viterbi

# Build with transformer tagger (higher accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

# Build sample database (includes pre-computed POS probabilities)
myspellchecker build --sample

Checking Database for POS Tables

import sqlite3

conn = sqlite3.connect("dict.db")
cursor = conn.cursor()

# Check POS table counts
for table in ["pos_unigrams", "pos_bigrams", "pos_trigrams"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"{table}: {count} entries")

conn.close()

Fallback Behavior

If POS probability tables are empty or missing:
  1. Viterbi falls back to morphological analysis for tag prediction
  2. Accuracy drops from ~85% to ~70% (similar to rule-based tagger)
  3. A warning is logged: “Provider does not support bigram probabilities”

How It Works

Trigram Hidden Markov Model

The implementation uses a trigram HMM with deleted interpolation smoothing:
States: POS tag pairs (tag_{i-2}, tag_{i-1})
Observations: Words/syllables
Transitions: P(tag_i | tag_{i-2}, tag_{i-1})  # trigram
Emissions: P(word | tag)

Algorithm Steps

  1. Initialization: Set probabilities for first position
  2. Recursion: For each position, compute best path using trigram transitions with beam pruning
  3. Termination: Find best final state
  4. Backtracking: Reconstruct optimal path
# Pseudocode (trigram with deleted interpolation)
for t in range(1, T):
    for s in beam_states:
        score = smoothed_trigram(tag_prev_prev, tag_prev, s) * emission[s][word[t]]
        # smoothed_trigram interpolates unigram, bigram, and trigram probabilities

Implementation

Python Wrapper

from myspellchecker.algorithms.viterbi import ViterbiTagger

tagger = ViterbiTagger(
    provider=provider,
    pos_bigram_probs=bigram_probs,      # P(t2|t1) transition probabilities
    pos_trigram_probs=trigram_probs,     # P(t3|t1,t2) trigram transitions
    pos_unigram_probs=unigram_probs,    # P(t) unigram probabilities (optional)
    unknown_word_tag="UNK",             # Fallback tag for unknown words (default: "UNK")
    min_prob=1e-10,                     # Minimum probability floor (default: 1e-10)
    beam_width=10,                      # Beam search pruning (default: 10)
    emission_weight=1.2,                # Weight for emission scores (default: 1.2)
    use_morphology_fallback=True,       # Use MorphologyAnalyzer for unknown words
    lambda_unigram=0.1,                 # Deleted interpolation: unigram weight
    lambda_bigram=0.3,                  # Deleted interpolation: bigram weight
    lambda_trigram=0.6,                 # Deleted interpolation: trigram weight
    adaptive_beam=False,                # Enable adaptive beam width
    min_beam_width=5,                   # Minimum beam width for adaptive mode
    max_beam_width=20,                  # Maximum beam width for adaptive mode
)

tags = tagger.tag_sequence(["ကျွန်တော်", "သွားပါမယ်"])
# Output: ["PRON", "VERB"]

Cython Acceleration

The library includes a Cython implementation for 2-5x speedup:
# Automatically uses Cython if available
from myspellchecker.algorithms.viterbi import ViterbiTagger

# Check if Cython is loaded
from myspellchecker.algorithms.viterbi import _HAS_CYTHON_VITERBI
print(_HAS_CYTHON_VITERBI)  # True if Cython extension loaded

Configuration

from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    tagger_type="viterbi",      # Use Viterbi tagger
    beam_width=10,              # Beam search width (default: 10)
)
Note: Unknown word handling is controlled via the min_prob (default: 1e-10) and unknown_word_tag (default: "UNK") parameters on the ViterbiTagger constructor, not via a separate penalty parameter.
Config resolution: When both constructor arguments and a POSTaggerConfig are provided, explicit constructor arguments always take precedence, even if they match the default value. This means ViterbiTagger(beam_width=10, config=config) uses 10 regardless of what config.viterbi_beam_width says. Omitted parameters fall back to the config, then to built-in defaults.

POSTaggerConfig Viterbi Fields

The POSTaggerConfig class exposes Viterbi-specific tuning parameters. These are separate from the constructor parameters on ViterbiTagger — they control the config-driven pipeline integration.
from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    tagger_type="viterbi",
    viterbi_beam_width=10,
    viterbi_emission_weight=1.2,
    viterbi_lambda_trigram=0.6,
)
FieldTypeDefaultDescription
viterbi_min_probfloat1e-10Minimum probability floor to prevent underflow
viterbi_beam_widthint10Default beam search width for pruning
viterbi_emission_weightfloat1.2Weight for emission probabilities in scoring
viterbi_lambda_unigramfloat0.1Deleted interpolation weight for unigram
viterbi_lambda_bigramfloat0.3Deleted interpolation weight for bigram
viterbi_lambda_trigramfloat0.6Deleted interpolation weight for trigram
viterbi_emission_cache_sizeint4096LRU cache size for emission score lookups
viterbi_transition_cache_sizeint2048LRU cache size for transition probability lookups
viterbi_min_beam_widthint5Minimum beam width for adaptive beam mode
viterbi_max_beam_widthint20Maximum beam width for adaptive beam mode
viterbi_short_sequence_thresholdint5Sequences at or below this length use max beam
viterbi_long_sequence_thresholdint20Sequences above this length use min beam
The lambda_* weights control deleted interpolation smoothing and must sum to 1.0. Higher lambda_trigram gives more weight to trigram context (better for well-trained models), while higher lambda_unigram adds robustness when training data is sparse. Beam search prunes unlikely paths for efficiency:
# Instead of tracking all states, keep top-k
beam_width = 10  # Default

# At each step, only keep top beam_width paths
# Reduces complexity from O(|S|^T) to O(k * |S| * T)
Beam WidthSpeedAccuracy
5Fast~83%
10Balanced (default)~85%
20+Slow~86%

Joint Segmentation + Tagging

The Viterbi algorithm can jointly segment and tag text:
from myspellchecker.algorithms.joint_segment_tagger import JointSegmentTagger

tagger = JointSegmentTagger(
    provider=provider,
    beam_width=15,
    word_score_weight=1.0,
)

segments, tags = tagger.segment_and_tag("ကျွန်တော်သွားပါမယ်")
# segments: ["ကျွန်တော်", "သွားပါမယ်"]
# tags: ["PRON", "VERB"]
See Joint Segment Tagger for details.

Emission Score Logic

For the first token (t=1), emission scores are handled specially:
# At t=1, emission score = P(word | tag)
# No transition from previous state

# For t>1, score = transition * emission
score = transition[prev_tag][tag] * emission[tag][word]

Performance

OperationCythonPythonSpeedup
Tag 10 words1.5ms5ms~3x
Tag 100 words10ms30ms~3x
Tag 1000 words80ms250ms~3x

Integration

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="viterbi",
        beam_width=10,
    )
)

checker = SpellChecker(config=config)

See Also