POS Inference Manager - mySpellChecker

Corpus-derived dictionaries often leave many words without POS tags — they were not in the seed data, are domain-specific, or are newly encountered. This module fills that gap by applying suffix patterns, prefix patterns, numeral detection, and an ambiguous-word registry to infer POS tags with confidence scores.

Overview

from myspellchecker.data_pipeline.pos_inference_manager import POSInferenceManager

manager = POSInferenceManager(conn, cursor, console)

# Apply POS inference to untagged words
stats = manager.apply_inferred_pos(min_frequency=5)
print(f"Inferred POS for {stats['inferred']} words")

Purpose

During dictionary building, many words lack POS tags because:

They weren’t in the POS seed data
They’re domain-specific terms
They’re newly encountered words

The POSInferenceManager fills this gap using morphological rules.

POSInferenceManager Class

class POSInferenceManager:
    """Manages POS inference for the database.

    Responsibilities:
    - Apply rule-based POS inference to words
    - Track POS coverage statistics
    - Report inference progress
    """

    def __init__(
        self,
        conn: sqlite3.Connection,
        cursor: sqlite3.Cursor,
        console: Optional[PipelineConsole] = None,
    ):
        self.conn = conn
        self.cursor = cursor
        self.console = console or PipelineConsole()

Applying POS Inference

Basic Usage

manager = POSInferenceManager(conn, cursor)

# Apply inference to all untagged words
stats = manager.apply_inferred_pos()

With Options

stats = manager.apply_inferred_pos(
    min_frequency=5,        # Only infer for words with freq >= 5
    skip_tagged=True,       # Skip words that already have pos_tag
    min_confidence=0.6,     # Only apply if confidence >= 0.6
    in_transaction=False,   # Commit after updates
)

Parameters

Parameter	Default	Description
`min_frequency`	0	Minimum word frequency threshold
`skip_tagged`	True	Skip words with existing pos_tag
`min_confidence`	0.0	Minimum confidence for inference
`in_transaction`	False	Don’t commit (caller manages transaction)

Return Statistics

stats = manager.apply_inferred_pos()

print(stats)
# {
#     "total_words": 50000,
#     "inferred": 35000,
#     "skipped_tagged": 10000,
#     "skipped_low_conf": 2000,
#     "ambiguous": 5000,
#     "by_source": {
#         "suffix_pattern": 20000,
#         "prefix_pattern": 5000,
#         "numeral_detection": 1000,
#         "proper_noun_suffix": 3000,
#         "ambiguous_registry": 6000,
#     }
# }

Statistics Fields

Field	Description
`total_words`	Total words processed
`inferred`	Words with successful inference
`skipped_tagged`	Words skipped (already had pos_tag)
`skipped_low_conf`	Words skipped due to low confidence
`ambiguous`	Words with multi-POS (e.g., “N\|V”)
`by_source`	Breakdown by inference source

Inference Sources

The POSInferenceEngine uses multiple strategies:

1. Suffix Patterns

# Words ending in common suffixes
"စားခဲ့သည်" → "V"  # Verb ending -သည်
"ကျောင်းသား" → "N" # Noun ending -သား

2. Prefix Patterns

# Words starting with common prefixes
"အလုပ်" → "N"  # အ- prefix (nominalization)
"မသွား" → "V"  # မ- prefix (negation)

3. Numeral Detection

# Numeric patterns
"၁၂၃" → "NUM"
"တစ်ရာ" → "NUM"

4. Proper Noun Patterns

# Capitalization/naming patterns
"ကိုမောင်" → "N"  # Title + name

5. Ambiguous Words Registry

# Known multi-POS words
"ကြီး" → "ADJ|N|V"  # Registered as ambiguous

POS Coverage Statistics

Check POS tag coverage in the database:

stats = manager.get_pos_coverage_stats()

print(stats)
# {
#     "total_words": 100000,
#     "with_pos_tag": 30000,      # From seed data
#     "with_inferred_pos": 45000, # From inference
#     "combined_coverage": 65000, # Either source
#     "no_pos": 35000,            # No POS info
#     "ambiguous": 5000,          # Multi-POS words
# }

Coverage Calculation

coverage_pct = (stats["combined_coverage"] / stats["total_words"]) * 100
print(f"POS Coverage: {coverage_pct:.1f}%")

Database Schema

The manager updates these columns:

-- Words table columns for inferred POS
ALTER TABLE words ADD COLUMN inferred_pos TEXT;
ALTER TABLE words ADD COLUMN inferred_confidence REAL;
ALTER TABLE words ADD COLUMN inferred_source TEXT;

Column Usage

Column	Description	Example
`pos_tag`	From seed data	”N”
`inferred_pos`	From inference	”N\|V”
`inferred_confidence`	Confidence score	0.85
`inferred_source`	Inference method	”suffix_pattern”

Integration with Pipeline

# Pipeline delegates POS inference to DatabasePackager,
# which internally creates and manages POSInferenceManager.
# The Pipeline does NOT access conn/cursor directly.

from myspellchecker.data_pipeline import Pipeline

# During pipeline.run(), the packager stage handles POS inference:
# packager.apply_inferred_pos() is called internally
# which creates POSInferenceManager with the packager's own connection

Best Practices

1. Run After Data Loading

Seed data loading (syllables, words, n-grams)
Corpus data loading and frequency counting
POS inference on indexed words

The Pipeline.build_database() method handles this order automatically. For manual control, use POSInferenceManager directly after loading data via DatabasePackager.

2. Use Appropriate Thresholds

# High-frequency words: more reliable inference
manager.apply_inferred_pos(
    min_frequency=10,
    min_confidence=0.7,
)

# Low-frequency words: lower thresholds
manager.apply_inferred_pos(
    min_frequency=2,
    min_confidence=0.5,
)

3. Check Coverage After Inference

stats = manager.get_pos_coverage_stats()

if stats["no_pos"] > stats["total_words"] * 0.5:
    logger.warning("More than 50% of words have no POS tag")

​Overview

​Purpose

​POSInferenceManager Class

​Applying POS Inference

​Basic Usage

​With Options

​Parameters

​Return Statistics

​Statistics Fields

​Inference Sources

​1. Suffix Patterns

​2. Prefix Patterns

​3. Numeral Detection

​4. Proper Noun Patterns

​5. Ambiguous Words Registry

​POS Coverage Statistics

​Coverage Calculation

​Database Schema

​Column Usage

​Integration with Pipeline

​Best Practices

​1. Run After Data Loading

​2. Use Appropriate Thresholds

​3. Check Coverage After Inference

​See Also