Overview
The Joint Optimization Problem
Traditional approach (sequential):Benefits
| Aspect | Sequential | Joint |
|---|---|---|
| Optimization | Local | Global |
| Error propagation | Yes | Minimal |
| Passes | Multiple | Single |
| Ambiguity handling | Limited | Better |
Mathematical Formulation
The tagger finds:P(word_i)- Word n-gram probabilityP(tag_i | tags)- Tag transition probability (HMM)P(tag_i | word_i)- Emission probability
JointSegmentTagger Class
Parameters
| Parameter | Default | Description | |
|---|---|---|---|
provider | Required | Dictionary provider for word lookups | |
pos_bigram_probs | Required | P(tag | prev_tag) transitions |
pos_trigram_probs | Required | P(tag | prev2, prev1) trigrams |
pos_unigram_probs | None | P(tag) priors for fallback | |
word_tag_probs | None | P(tag | word) emissions |
min_prob | 1e-10 | Minimum probability for smoothing | |
max_word_length | 20 | Maximum word length in chars | |
beam_width | 15 | Beam size for pruning | |
emission_weight | 1.2 | Weight for emission scores | |
word_score_weight | 1.0 | Weight for word n-gram scores | |
use_morphology_fallback | True | Use morphology for OOV words |
Usage
Basic Segmentation and Tagging
Batch Processing
State Space
The Viterbi algorithm operates on states:(position, word_start, current_tag, prev_tag)
Scoring Functions
Word Score
Tag Transition Score
Emission Score
Beam Pruning
To manage the large state space, beam pruning keeps only top-k states:OOV Handling
For out-of-vocabulary words, the tagger uses morphological analysis:Performance
Complexity
- Time: O(n × W × T²) where n=length, W=max_word_length, T=num_tags
- Space: O(n × beam_width)
Benchmarks
| Text Length | Sequential | Joint | Speedup |
|---|---|---|---|
| 50 chars | 5ms | 8ms | 0.6x |
| 200 chars | 20ms | 25ms | 0.8x |
| 1000 chars | 100ms | 90ms | 1.1x |
Cache Management
Integration
With SpellChecker
See Also
- Segmenters - Text segmentation
- POS Tagging - POS tagging overview
- POS Disambiguator - Disambiguation rules
- Morphology Analysis - Word structure