Morphology Analysis - mySpellChecker

When the spell checker encounters an unknown word, the morphology module decomposes it into roots and affixes to infer likely POS tags and generate better suggestions.

Overview

Myanmar words are often formed by combining roots with suffixes and prefixes. The MorphologyAnalyzer can:

Guess POS tags based on word morphology
Decompose words into roots and suffixes
Identify numeral words
Handle ambiguous words with multiple POS possibilities
Extract morphological patterns for OOV recovery

MorphologyAnalyzer

The main class for morphological analysis.

Basic POS Guessing

from myspellchecker.text.morphology import MorphologyAnalyzer

analyzer = MorphologyAnalyzer()

# Get all possible POS tags (set)
tags = analyzer.guess_pos("စားပြီ")
print(tags)  # {'V', 'P_SENT'}

# Get ranked POS guesses with confidence
guesses = analyzer.guess_pos_ranked("စားပြီ")
for guess in guesses:
    print(f"{guess.tag}: {guess.confidence:.2f} ({guess.reason})")
# Output:
# P_SENT: 0.67 (particle suffix "ပြီ")
# V: 0.45 (verb suffix "ပြီ")

# Get the most likely POS tag
best = analyzer.guess_pos_best("စားပြီ")
print(best)  # "P_SENT"

Multi-POS Support

Handle words with multiple possible parts of speech:

# Get all possible tags for ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("ကြီး")
print(tags)       # frozenset({'N', 'V', 'ADJ'})
print(confidence) # 0.70 (medium - needs context)
print(source)     # "ambiguous_registry"

# Non-ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("အလုပ်")
print(tags)       # frozenset({'N'})
print(confidence) # 0.75
print(source)     # "morphological_inference"

Word Analysis

Decompose words into roots and suffixes:

from myspellchecker.text.morphology import analyze_word, MorphologyAnalyzer

# Analyze word structure
result = analyze_word("စားခဲ့သည်")

print(result.original)    # "စားခဲ့သည်"
print(result.root)        # "စား"
print(result.suffixes)    # ['ခဲ့', 'သည်']
print(result.confidence)  # 0.85
print(result.is_compound) # False
print(result.pos_guesses) # List of POSGuess for the original word

With Dictionary Validation

Validate extracted roots against a dictionary:

# Define a dictionary check function
def is_valid_word(word):
    valid_words = {"စား", "သွား", "လာ", "ကြည့်"}
    return word in valid_words

# Analyze with dictionary validation
result = analyze_word("စားခဲ့သည်", dictionary_check=is_valid_word)
print(result.root)        # "စား" (validated)
print(result.confidence)  # Higher due to dictionary validation

Using Cached Analyzer

For better performance on repeated calls:

from myspellchecker.text.morphology import get_cached_analyzer

# Get analyzer with Stemmer integration (LRU caching)
analyzer = get_cached_analyzer()

# Or use convenience function with caching
result = analyze_word("စားခဲ့သည်", use_cache=True)

Numeral Detection

Detect Myanmar numerals (digits and words):

from myspellchecker.text.morphology import is_numeral_word, get_numeral_pos_guess

# Check for numerals
is_numeral_word("၁၂၃")   # True (Myanmar digits)
is_numeral_word("သုံး")   # True (numeral word "three")
is_numeral_word("စား")   # False

# Get POS guess for numerals
guess = get_numeral_pos_guess("၁၂၃")
print(guess.tag)         # "NUM"
print(guess.confidence)  # 0.99 (very high for digits)

guess = get_numeral_pos_guess("သုံး")
print(guess.confidence)  # 0.95 (slightly lower for words)

POSGuess Data Structure

Results include detailed reasoning:

from myspellchecker.text.morphology import POSGuess

# POSGuess attributes
guess = POSGuess(
    tag="V",
    confidence=0.85,
    reason='verb suffix "ခဲ့"'
)

print(guess.tag)        # "V"
print(guess.confidence) # 0.85
print(guess.reason)     # 'verb suffix "ခဲ့"'

WordAnalysis Data Structure

Complete word decomposition:

from myspellchecker.text.morphology import WordAnalysis

# WordAnalysis attributes
analysis = WordAnalysis(
    original="စားခဲ့သည်",
    root="စား",
    suffixes=["ခဲ့", "သည်"],
    pos_guesses=[POSGuess(...)],
    confidence=0.85,
    is_compound=False
)

POS Tag Priority

When multiple suffixes match, tags are prioritized:

Priority	Tag	Description
0	NUM	Numerals (highest)
1	P_SENT	Sentence particles
2	P_MOD	Modifier particles
3	P_LOC	Location particles
4	P_SUBJ	Subject particles
5	P_OBJ	Object particles
6	V	Verbs
7	N	Nouns
8	ADJ	Adjectives
9	ADV	Adverbs

Confidence Scoring

Confidence is based on:

Numeral detection: 0.95-0.99 (highest)
Proper noun suffixes: 0.85-0.95
Prefix patterns (e.g., အ → Noun): 0.60-0.75
Suffix length ratio: Longer suffix matches = higher confidence
Tag priority: Tie-breaker for similar confidence

# Example confidence calculation
analyzer = MorphologyAnalyzer()

# Long suffix = higher confidence
guesses = analyzer.guess_pos_ranked("စားခဲ့သည်")
# "သည်" is longer relative to word, higher confidence

# Prefix-based inference
guesses = analyzer.guess_pos_ranked("အလုပ်")
# "အ" prefix indicates noun, ~0.60-0.75 confidence

Integration with Stemmer

For performance-critical applications, integrate with the Stemmer:

from myspellchecker.text.morphology import MorphologyAnalyzer
from myspellchecker.text.stemmer import Stemmer

# Create analyzer with Stemmer (has LRU cache)
stemmer = Stemmer()
analyzer = MorphologyAnalyzer(stemmer=stemmer)

# Suffix stripping is now cached
result1 = analyzer.analyze_word("စားခဲ့သည်")  # Computes
result2 = analyzer.analyze_word("စားခဲ့သည်")  # Returns cached result

Configuration

The MorphologyConfig controls confidence values used in OOV recovery and suffix-based POS guessing. Pass it to MorphologyAnalyzer to tune morphological analysis behavior.

from myspellchecker.core.config import MorphologyConfig
from myspellchecker.text.morphology import MorphologyAnalyzer

morph_config = MorphologyConfig(
    # POS guessing confidence multipliers
    particle_confidence_boost=1.2,  # Multiplier for particle suffix confidence
    particle_confidence_cap=1.0,    # Max confidence after particle boost
    verb_suffix_weight=0.9,         # Weight for verb suffix scoring
    noun_suffix_weight=0.85,        # Weight for noun suffix scoring
    adverb_suffix_weight=0.8,       # Weight for adverb suffix scoring

    # OOV recovery confidence
    oov_base_confidence=0.3,        # Base confidence for suffix analysis
    oov_scale_factor=0.7,           # Scale factor for suffix ratio contribution
    oov_cap=0.95,                   # Max confidence from suffix analysis alone
    dictionary_boost=0.2,           # Boost when dictionary confirms root
    dictionary_cap=0.98,            # Max confidence after dictionary boost
    fallback_with_dict=0.5,         # Confidence: no suffixes but root in dict
    fallback_without_dict=0.2,      # Confidence: no suffixes, unknown root

    # Safety limit
    max_suffix_strip_iterations=5,  # Max iterations for suffix stripping
)

analyzer = MorphologyAnalyzer(morphology_config=morph_config)

Field	Default	Description
`particle_confidence_boost`	`1.2`	Multiplier for particle suffix confidence (particles are reliable indicators)
`particle_confidence_cap`	`1.0`	Maximum confidence after particle boost
`verb_suffix_weight`	`0.9`	Weight for verb suffix confidence scoring
`noun_suffix_weight`	`0.85`	Weight for noun suffix confidence scoring
`adverb_suffix_weight`	`0.8`	Weight for adverb suffix confidence scoring
`oov_base_confidence`	`0.3`	Base confidence for OOV suffix analysis
`oov_scale_factor`	`0.7`	Scale factor for suffix ratio contribution
`oov_cap`	`0.95`	Maximum confidence from suffix analysis alone
`dictionary_boost`	`0.2`	Confidence boost when dictionary confirms the extracted root
`dictionary_cap`	`0.98`	Maximum confidence after dictionary boost
`fallback_with_dict`	`0.5`	Confidence when no suffixes found but root is in dictionary
`fallback_without_dict`	`0.2`	Confidence when no suffixes found and root is unknown
`max_suffix_strip_iterations`	`5`	Maximum iterations for suffix stripping (prevents infinite loops)

Custom Grammar Config Path

Load morphology rules from a custom config directory:

# Use custom config path
analyzer = MorphologyAnalyzer(config_path="/path/to/grammar/config")

Suffix Categories

The analyzer recognizes these suffix types:

Verb Suffixes

ခဲ့ (past tense)
နေ (progressive)
မည် (future)
ပြီ (completion)
ရ (potential)

Noun Suffixes

များ (plural)
ချင်း (comparative)
လောက် (approximation)

Particle Suffixes

သည် (formal ending)
တယ် (colloquial ending)
မှာ (location)
ကို (object marker)

Adverb Suffixes

စွာ (manner)
အောင် (result)

Morphological Synthesis

In addition to morphological analysis (decomposition), the library provides morphological synthesis (validation of productive word formation). These modules validate OOV words formed through compounding and reduplication.

ReduplicationEngine

Validates words formed by repeating known dictionary words:

from myspellchecker.text.reduplication import ReduplicationEngine

engine = ReduplicationEngine(
    segmenter=segmenter,
    min_base_frequency=5,       # Minimum frequency for base word
    cache_size=1024,
    allowed_base_pos=frozenset({"V", "ADJ", "ADV", "N"}),
)

result = engine.analyze(
    "ကောင်းကောင်း",
    dictionary_check=lambda w: provider.is_valid_word(w),
    frequency_check=lambda w: provider.get_word_frequency(w),
    pos_check=lambda w: get_pos(w),
)

if result and result.is_valid:
    print(f"Pattern: {result.pattern}")     # "AB" (AA reduplication)
    print(f"Base: {result.base_word}")       # "ကောင်း"
    print(f"Confidence: {result.confidence}") # 0.90+

Supported patterns:

Pattern	Example	Description
AA	ကောင်းကောင်း	Simple repetition
AABB	သေသေချာချာ	Each syllable doubles (A-A-B-B)
ABAB	ခဏခဏ	Whole word repeats (AB-AB)
RHYME	ရှုပ်ယှက်	Known rhyme pairs

CompoundResolver

Validates compound words by splitting into known dictionary morphemes using dynamic programming:

from myspellchecker.text.compound_resolver import CompoundResolver

resolver = CompoundResolver(
    segmenter=segmenter,
    min_morpheme_frequency=10,  # Minimum frequency per morpheme
    max_parts=4,                # Maximum parts in compound
    cache_size=1024,
)

result = resolver.resolve(
    "ကျောင်းသား",
    dictionary_check=lambda w: provider.is_valid_word(w),
    frequency_check=lambda w: provider.get_word_frequency(w),
    pos_check=lambda w: get_pos(w),
)

if result and result.is_valid:
    print(f"Parts: {result.parts}")      # ["ကျောင်း", "သား"]
    print(f"Pattern: {result.pattern}")   # "N+N"
    print(f"Confidence: {result.confidence}")

Allowed compound patterns (13 patterns):

Pattern	Description
N+N	Noun + Noun (e.g., ကျောင်းသား student)
V+V	Verb + Verb (e.g., စားသောက် eat & drink)
N+V	Noun + Verb (e.g., ရေချိုး bathe)
V+N	Verb + Noun (e.g., စားခန်း dining room)
ADJ+N	Adjective + Noun (e.g., ကြီးမား big city)
N+ADJ	Noun + Adjective
ADJ+ADJ	Adjective + Adjective
V+ADJ	Verb + Adjective
ADJ+V	Adjective + Verb
ADV+V	Adverb + Verb
ADV+N	Adverb + Noun
ADV+ADV	Adverb + Adverb
TN+N	Temporal Noun + Noun

Each pattern has a morphotactic bonus applied during DP scoring (e.g., N+N: +0.10, ADJ+N: +0.08) that favors linguistically common combinations. The DP algorithm penalizes additional parts beyond 2 via a configurable parts_penalty_multiplier (default: 2.0).

Integration

Both engines are automatically integrated into WordValidator when enabled via config:

config = SpellCheckerConfig(
    validation=ValidationConfig(
        use_reduplication_validation=True,   # Enabled by default
        use_compound_synthesis=True,         # Enabled by default
    )
)

Relationship to MorphologyAnalyzer

Module	Purpose	When Used
`text/morphology.py`	POS guessing, suffix stripping, OOV decomposition	Suggestion generation (analyzing unknown words)
`text/reduplication.py`	Validate productive reduplications	Before error creation (suppress false positives)
`text/compound_resolver.py`	Validate productive compounds	Before error creation (suppress false positives)

​Overview

​MorphologyAnalyzer

​Basic POS Guessing

​Multi-POS Support

​Word Analysis

​With Dictionary Validation

​Using Cached Analyzer

​Numeral Detection

​POSGuess Data Structure

​WordAnalysis Data Structure

​POS Tag Priority

​Confidence Scoring

​Integration with Stemmer

​Configuration

​Custom Grammar Config Path

​Suffix Categories

​Verb Suffixes

​Noun Suffixes

​Particle Suffixes

​Adverb Suffixes

​Morphological Synthesis

​ReduplicationEngine

​CompoundResolver

​Integration

​Relationship to MorphologyAnalyzer

​See Also