Joint Segment Tagger - mySpellChecker

Traditional pipelines segment text first, then tag the results — propagating early segmentation errors into downstream POS tags. This module eliminates that cascade by searching over word boundaries and POS labels in a single Viterbi pass, finding the globally optimal combination.

Overview

from myspellchecker.algorithms.joint_segment_tagger import JointSegmentTagger

tagger = JointSegmentTagger(
    provider,
    pos_bigram_probs,
    pos_trigram_probs,
)

words, tags = tagger.segment_and_tag("မြန်မာနိုင်ငံ")
print(list(zip(words, tags)))
# [('မြန်မာ', 'N'), ('နိုင်ငံ', 'N')]

The Joint Optimization Problem

Traditional approach (sequential):

Text → Segment → Words → Tag → POS
       ↓          ↓
    (errors)  (propagate)

Joint approach (this module):

Text → Optimize(Segment + Tag) → Words + POS
                ↓
        (global optimum)

Benefits

Aspect	Sequential	Joint
Optimization	Local	Global
Error propagation	Yes	Minimal
Passes	Multiple	Single
Ambiguity handling	Limited	Better

Mathematical Formulation

The tagger finds:

argmax P(words, tags | text)
= argmax Π P(word_i) × P(tag_i | tag_{i-1}, tag_{i-2}) × P(tag_i | word_i)

In log space:

= argmax Σ [log P(word_i) + log P(tag_i | tags) + log P(tag_i | word_i)]

Components:

P(word_i) - Word N-gram probability
P(tag_i | tags) - Tag transition probability (HMM)
P(tag_i | word_i) - Emission probability

JointSegmentTagger Class

class JointSegmentTagger:
    """Joint word segmentation and POS tagging using unified Viterbi."""

    def __init__(
        self,
        provider: DictionaryProvider,
        pos_bigram_probs: Dict[Tuple[str, str], float],
        pos_trigram_probs: Dict[Tuple[str, str, str], float],
        pos_unigram_probs: Optional[Dict[str, float]] = None,
        word_tag_probs: Optional[Dict[str, Dict[str, float]]] = None,
        min_prob: float = 1e-10,
        max_word_length: int = 20,
        beam_width: int = 15,
        emission_weight: float = 1.2,
        word_score_weight: float = 1.0,
        use_morphology_fallback: bool = True,
    ):
        ...

Parameters

Parameter	Default	Description
`provider`	Required	Dictionary provider for word lookups
`pos_bigram_probs`	Required	P(tag \| prev_tag) transitions
`pos_trigram_probs`	Required	P(tag \| prev2, prev1) trigrams
`pos_unigram_probs`	None	P(tag) priors for fallback
`word_tag_probs`	None	P(tag \| word) emissions
`min_prob`	1e-10	Minimum probability for smoothing
`max_word_length`	20	Maximum word length in chars
`beam_width`	15	Beam size for pruning
`emission_weight`	1.2	Weight for emission scores
`word_score_weight`	1.0	Weight for word N-gram scores
`use_morphology_fallback`	True	Use morphology for OOV words

Usage

Basic Segmentation and Tagging

# Create tagger
tagger = JointSegmentTagger(
    provider=provider,
    pos_bigram_probs=bigram_probs,
    pos_trigram_probs=trigram_probs,
)

# Process text
words, tags = tagger.segment_and_tag("မြန်မာနိုင်ငံသည်အရှေ့တောင်အာရှတွင်တည်ရှိသည်")

for word, tag in zip(words, tags):
    print(f"{word}\t{tag}")
# မြန်မာ     N
# နိုင်ငံ    N
# သည်       P_SENT
# အရှေ့တောင် N
# အာရှ      N
# တွင်      PPM
# တည်ရှိ    V
# သည်       P_SENT

State Space

The Viterbi algorithm operates on states:

@dataclass
class JointState:
    """State in the joint segmentation-tagging lattice."""

    word_start: int       # Character index where current word starts
    current_tag: str      # POS tag for current word
    prev_tag: str         # POS tag for previous word
    score: float          # Log probability score
    backpointer: Optional["JointState"]  # Previous state

State space: (position, word_start, current_tag, prev_tag)

Scoring Functions

Word Score

def _get_word_score(self, word: str, prev_word: str) -> float:
    """Get word n-gram score: log P(word | prev_word)."""
    # Try bigram first
    bigram_prob = self.provider.get_bigram_probability(prev_word, word)
    if bigram_prob > self.min_prob:
        return self.word_score_weight * math.log(bigram_prob)

    # Fallback to unigram
    freq = self.provider.get_word_frequency(word)
    if freq > 0:
        return self.word_score_weight * math.log(freq / 1e6)

    # Unknown word penalty
    return self.word_score_weight * (self.log_min_prob - len(word) * 0.5)

Tag Transition Score

def _get_tag_transition_score(self, tag: str, prev_tag: str, prev_prev_tag: str) -> float:
    """Get POS tag transition score: log P(tag | prev_prev_tag, prev_tag)."""
    # Try trigram
    trigram_prob = self.pos_trigram_probs.get((prev_prev_tag, prev_tag, tag), 0.0)
    if trigram_prob > self.min_prob:
        return math.log(trigram_prob)

    # Fallback to bigram
    bigram_prob = self.pos_bigram_probs.get((prev_tag, tag), 0.0)
    if bigram_prob > self.min_prob:
        return math.log(bigram_prob)

    # Fallback to unigram
    unigram_prob = self.pos_unigram_probs.get(tag, self.min_prob)
    return math.log(unigram_prob)

Emission Score

def _get_emission_score(self, word: str, tag: str) -> float:
    """Get emission score: log P(tag | word)."""
    if word in self.word_tag_probs:
        prob = self.word_tag_probs[word].get(tag, self.min_prob)
        return self.emission_weight * math.log(prob)

    # Fallback to tag prior
    if self.pos_unigram_probs:
        prob = self.pos_unigram_probs.get(tag, self.min_prob)
        return self.emission_weight * math.log(prob)

    return 0.0

Beam Pruning

To manage the large state space, beam pruning keeps only top-k states:

def _prune_beam(self, dp, end_pos):
    """Apply beam pruning to keep only top-k states."""
    if len(dp[end_pos]) > self.beam_width:
        top_states = nlargest(
            self.beam_width,
            dp[end_pos].items(),
            key=lambda x: x[1][0],  # Sort by score
        )
        dp[end_pos] = dict(top_states)

OOV Handling

For out-of-vocabulary words, the tagger uses morphological analysis:

def _get_valid_tags_for_word(self, word: str) -> Set[str]:
    """Get valid POS tags for a word."""
    tags = set()

    # From database
    pos_str = self.provider.get_word_pos(word)
    if pos_str:
        tags.update(pos_str.split("|"))

    # From word-tag probabilities
    if word in self.word_tag_probs:
        tags.update(self.word_tag_probs[word].keys())

    # Morphological fallback for OOV
    if not tags and self.morphology_analyzer:
        morpho_tags = self.morphology_analyzer.guess_pos(word)
        if morpho_tags:
            tags.update(morpho_tags)

    # Final fallback
    if not tags:
        return {self.UNKNOWN_TAG}

    return tags

Performance

Complexity

Time: O(n × W × T²) where n=length, W=max_word_length, T=num_tags
Space: O(n × beam_width)

Benchmarks

Text Length	Sequential	Joint	Speedup
50 chars	5ms	8ms	0.6x
200 chars	20ms	25ms	0.8x
1000 chars	100ms	90ms	1.1x

Joint is slightly slower for short texts but comparable for longer texts, with better accuracy.

Cache Management

# Clear cache when done
tagger.clear_cache()

Integration

With SpellChecker

from myspellchecker import SpellChecker

class SpellChecker:
    def __init__(self, joint_tagger: JointSegmentTagger = None):
        self.tagger = joint_tagger

    def check(self, text: str):
        # Use joint segmentation and tagging
        words, tags = self.tagger.segment_and_tag(text)

        # Check spelling with POS context
        errors = []
        for i, (word, tag) in enumerate(zip(words, tags)):
            if not self._is_valid(word, tag):
                errors.append(self._create_error(word, tag, i))

        return errors

​Overview

​The Joint Optimization Problem

​Benefits

​Mathematical Formulation

​JointSegmentTagger Class

​Parameters

​Usage

​Basic Segmentation and Tagging

​State Space

​Scoring Functions

​Word Score

​Tag Transition Score

​Emission Score

​Beam Pruning

​OOV Handling

​Performance

​Complexity

​Benchmarks

​Cache Management

​Integration

​With SpellChecker

​See Also