Skip to main content
Corpus-derived dictionaries often leave many words without POS tags — they were not in the seed data, are domain-specific, or are newly encountered. This module fills that gap by applying suffix patterns, prefix patterns, numeral detection, and an ambiguous-word registry to infer POS tags with confidence scores.

Overview

from myspellchecker.data_pipeline.pos_inference_manager import POSInferenceManager

manager = POSInferenceManager(conn, cursor, console)

# Apply POS inference to untagged words
stats = manager.apply_inferred_pos(min_frequency=5)
print(f"Inferred POS for {stats['inferred']} words")

Purpose

During dictionary building, many words lack POS tags because:
  • They weren’t in the POS seed data
  • They’re domain-specific terms
  • They’re newly encountered words
The POSInferenceManager fills this gap using morphological rules.

POSInferenceManager Class

class POSInferenceManager:
    """Manages POS inference for the database.

    Responsibilities:
    - Apply rule-based POS inference to words
    - Track POS coverage statistics
    - Report inference progress
    """

    def __init__(
        self,
        conn: sqlite3.Connection,
        cursor: sqlite3.Cursor,
        console: Optional[PipelineConsole] = None,
    ):
        self.conn = conn
        self.cursor = cursor
        self.console = console or PipelineConsole()

Applying POS Inference

Basic Usage

manager = POSInferenceManager(conn, cursor)

# Apply inference to all untagged words
stats = manager.apply_inferred_pos()

With Options

stats = manager.apply_inferred_pos(
    min_frequency=5,        # Only infer for words with freq >= 5
    skip_tagged=True,       # Skip words that already have pos_tag
    min_confidence=0.6,     # Only apply if confidence >= 0.6
    in_transaction=False,   # Commit after updates
)

Parameters

ParameterDefaultDescription
min_frequency0Minimum word frequency threshold
skip_taggedTrueSkip words with existing pos_tag
min_confidence0.0Minimum confidence for inference
in_transactionFalseDon’t commit (caller manages transaction)

Return Statistics

stats = manager.apply_inferred_pos()

print(stats)
# {
#     "total_words": 50000,
#     "inferred": 35000,
#     "skipped_tagged": 10000,
#     "skipped_low_conf": 2000,
#     "ambiguous": 5000,
#     "by_source": {
#         "suffix_pattern": 20000,
#         "prefix_pattern": 5000,
#         "numeral_detection": 1000,
#         "proper_noun_suffix": 3000,
#         "ambiguous_registry": 6000,
#     }
# }

Statistics Fields

FieldDescription
total_wordsTotal words processed
inferredWords with successful inference
skipped_taggedWords skipped (already had pos_tag)
skipped_low_confWords skipped due to low confidence
ambiguousWords with multi-POS (e.g., “NV”)
by_sourceBreakdown by inference source

Inference Sources

The POSInferenceEngine uses multiple strategies:

1. Suffix Patterns

# Words ending in common suffixes
"စားခဲ့သည်""V"  # Verb ending -သည်
"ကျောင်းသား""N" # Noun ending -သား

2. Prefix Patterns

# Words starting with common prefixes
"အလုပ်""N"  # အ- prefix (nominalization)
"မသွား""V"  # မ- prefix (negation)

3. Numeral Detection

# Numeric patterns
"၁၂၃""NUM"
"တစ်ရာ""NUM"

4. Proper Noun Patterns

# Capitalization/naming patterns
"ကိုမောင်""N"  # Title + name

5. Ambiguous Words Registry

# Known multi-POS words
"ကြီး""ADJ|N|V"  # Registered as ambiguous

POS Coverage Statistics

Check POS tag coverage in the database:
stats = manager.get_pos_coverage_stats()

print(stats)
# {
#     "total_words": 100000,
#     "with_pos_tag": 30000,      # From seed data
#     "with_inferred_pos": 45000, # From inference
#     "combined_coverage": 65000, # Either source
#     "no_pos": 35000,            # No POS info
#     "ambiguous": 5000,          # Multi-POS words
# }

Coverage Calculation

coverage_pct = (stats["combined_coverage"] / stats["total_words"]) * 100
print(f"POS Coverage: {coverage_pct:.1f}%")

Database Schema

The manager updates these columns:
-- Words table columns for inferred POS
ALTER TABLE words ADD COLUMN inferred_pos TEXT;
ALTER TABLE words ADD COLUMN inferred_confidence REAL;
ALTER TABLE words ADD COLUMN inferred_source TEXT;

Column Usage

ColumnDescriptionExample
pos_tagFrom seed data”N”
inferred_posFrom inference”NV”
inferred_confidenceConfidence score0.85
inferred_sourceInference method”suffix_pattern”

Integration with Pipeline

# Pipeline delegates POS inference to DatabasePackager,
# which internally creates and manages POSInferenceManager.
# The Pipeline does NOT access conn/cursor directly.

from myspellchecker.data_pipeline import Pipeline

# During pipeline.run(), the packager stage handles POS inference:
# packager.apply_inferred_pos() is called internally
# which creates POSInferenceManager with the packager's own connection

Best Practices

1. Run After Data Loading

# Correct order in pipeline
pipeline.load_seed_data()      # Load POS seed first
pipeline.load_corpus_data()    # Load corpus
pipeline.apply_pos_inference() # Then infer missing POS

2. Use Appropriate Thresholds

# High-frequency words: more reliable inference
manager.apply_inferred_pos(
    min_frequency=10,
    min_confidence=0.7,
)

# Low-frequency words: lower thresholds
manager.apply_inferred_pos(
    min_frequency=2,
    min_confidence=0.5,
)

3. Check Coverage After Inference

stats = manager.get_pos_coverage_stats()

if stats["no_pos"] > stats["total_words"] * 0.5:
    logger.warning("More than 50% of words have no POS tag")

See Also