Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Corpus-derived dictionaries often leave many words without POS tags — they were not in the seed data, are domain-specific, or are newly encountered. This module fills that gap by applying suffix patterns, prefix patterns, numeral detection, and an ambiguous-word registry to infer POS tags with confidence scores.
Overview
from myspellchecker.data_pipeline.pos_inference_manager import POSInferenceManager
manager = POSInferenceManager(conn, cursor, console)
# Apply POS inference to untagged words
stats = manager.apply_inferred_pos(min_frequency=5)
print(f"Inferred POS for {stats['inferred']} words")
Purpose
During dictionary building, many words lack POS tags because:
- They weren’t in the POS seed data
- They’re domain-specific terms
- They’re newly encountered words
The POSInferenceManager fills this gap using morphological rules.
POSInferenceManager Class
class POSInferenceManager:
"""Manages POS inference for the database.
Responsibilities:
- Apply rule-based POS inference to words
- Track POS coverage statistics
- Report inference progress
"""
def __init__(
self,
conn: sqlite3.Connection,
cursor: sqlite3.Cursor,
console: Optional[PipelineConsole] = None,
):
self.conn = conn
self.cursor = cursor
self.console = console or PipelineConsole()
Applying POS Inference
Basic Usage
manager = POSInferenceManager(conn, cursor)
# Apply inference to all untagged words
stats = manager.apply_inferred_pos()
With Options
stats = manager.apply_inferred_pos(
min_frequency=5, # Only infer for words with freq >= 5
skip_tagged=True, # Skip words that already have pos_tag
min_confidence=0.6, # Only apply if confidence >= 0.6
in_transaction=False, # Commit after updates
)
Parameters
| Parameter | Default | Description |
|---|
min_frequency | 0 | Minimum word frequency threshold |
skip_tagged | True | Skip words with existing pos_tag |
min_confidence | 0.0 | Minimum confidence for inference |
in_transaction | False | Don’t commit (caller manages transaction) |
Return Statistics
stats = manager.apply_inferred_pos()
print(stats)
# {
# "total_words": 50000,
# "inferred": 35000,
# "skipped_tagged": 10000,
# "skipped_low_conf": 2000,
# "ambiguous": 5000,
# "by_source": {
# "suffix_pattern": 20000,
# "prefix_pattern": 5000,
# "numeral_detection": 1000,
# "proper_noun_suffix": 3000,
# "ambiguous_registry": 6000,
# }
# }
Statistics Fields
| Field | Description |
|---|
total_words | Total words processed |
inferred | Words with successful inference |
skipped_tagged | Words skipped (already had pos_tag) |
skipped_low_conf | Words skipped due to low confidence |
ambiguous | Words with multi-POS (e.g., “N|V”) |
by_source | Breakdown by inference source |
Inference Sources
The POSInferenceEngine uses multiple strategies:
1. Suffix Patterns
# Words ending in common suffixes
"စားခဲ့သည်" → "V" # Verb ending -သည်
"ကျောင်းသား" → "N" # Noun ending -သား
2. Prefix Patterns
# Words starting with common prefixes
"အလုပ်" → "N" # အ- prefix (nominalization)
"မသွား" → "V" # မ- prefix (negation)
3. Numeral Detection
# Numeric patterns
"၁၂၃" → "NUM"
"တစ်ရာ" → "NUM"
4. Proper Noun Patterns
# Capitalization/naming patterns
"ကိုမောင်" → "N" # Title + name
5. Ambiguous Words Registry
# Known multi-POS words
"ကြီး" → "ADJ|N|V" # Registered as ambiguous
POS Coverage Statistics
Check POS tag coverage in the database:
stats = manager.get_pos_coverage_stats()
print(stats)
# {
# "total_words": 100000,
# "with_pos_tag": 30000, # From seed data
# "with_inferred_pos": 45000, # From inference
# "combined_coverage": 65000, # Either source
# "no_pos": 35000, # No POS info
# "ambiguous": 5000, # Multi-POS words
# }
Coverage Calculation
coverage_pct = (stats["combined_coverage"] / stats["total_words"]) * 100
print(f"POS Coverage: {coverage_pct:.1f}%")
Database Schema
The manager updates these columns:
-- Words table columns for inferred POS
ALTER TABLE words ADD COLUMN inferred_pos TEXT;
ALTER TABLE words ADD COLUMN inferred_confidence REAL;
ALTER TABLE words ADD COLUMN inferred_source TEXT;
Column Usage
| Column | Description | Example |
|---|
pos_tag | From seed data | ”N” |
inferred_pos | From inference | ”N|V” |
inferred_confidence | Confidence score | 0.85 |
inferred_source | Inference method | ”suffix_pattern” |
Integration with Pipeline
# Pipeline delegates POS inference to DatabasePackager,
# which internally creates and manages POSInferenceManager.
# The Pipeline does NOT access conn/cursor directly.
from myspellchecker.data_pipeline import Pipeline
# During pipeline.run(), the packager stage handles POS inference:
# packager.apply_inferred_pos() is called internally
# which creates POSInferenceManager with the packager's own connection
Best Practices
1. Run After Data Loading
1. Seed data loading (syllables, words, n-grams)
2. Corpus data loading and frequency counting
3. POS inference on indexed words
The Pipeline.build_database() method handles this order automatically. For manual control,
use POSInferenceManager directly after loading data via DatabasePackager.
2. Use Appropriate Thresholds
# High-frequency words: more reliable inference
manager.apply_inferred_pos(
min_frequency=10,
min_confidence=0.7,
)
# Low-frequency words: lower thresholds
manager.apply_inferred_pos(
min_frequency=2,
min_confidence=0.5,
)
3. Check Coverage After Inference
stats = manager.get_pos_coverage_stats()
if stats["no_pos"] > stats["total_words"] * 0.5:
logger.warning("More than 50% of words have no POS tag")
See Also