Skip to main content
When the spell checker encounters an unknown word, the morphology module decomposes it into roots and affixes to infer likely POS tags and generate better suggestions.

Overview

Myanmar words are often formed by combining roots with suffixes and prefixes. The MorphologyAnalyzer can:
  • Guess POS tags based on word morphology
  • Decompose words into roots and suffixes
  • Identify numeral words
  • Handle ambiguous words with multiple POS possibilities
  • Extract morphological patterns for OOV recovery

MorphologyAnalyzer

The main class for morphological analysis.

Basic POS Guessing

from myspellchecker.text.morphology import MorphologyAnalyzer

analyzer = MorphologyAnalyzer()

# Get all possible POS tags (set)
tags = analyzer.guess_pos("စားပြီ")
print(tags)  # {'V', 'P_SENT'}

# Get ranked POS guesses with confidence
guesses = analyzer.guess_pos_ranked("စားပြီ")
for guess in guesses:
    print(f"{guess.tag}: {guess.confidence:.2f} ({guess.reason})")
# Output:
# P_SENT: 0.67 (particle suffix "ပြီ")
# V: 0.45 (verb suffix "ပြီ")

# Get the most likely POS tag
best = analyzer.guess_pos_best("စားပြီ")
print(best)  # "P_SENT"

Multi-POS Support

Handle words with multiple possible parts of speech:
# Get all possible tags for ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("ကြီး")
print(tags)       # frozenset({'N', 'V', 'ADJ'})
print(confidence) # 0.70 (medium - needs context)
print(source)     # "ambiguous_registry"

# Non-ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("အလုပ်")
print(tags)       # frozenset({'N'})
print(confidence) # 0.75
print(source)     # "morphological_inference"

Word Analysis

Decompose words into roots and suffixes:
from myspellchecker.text.morphology import analyze_word, MorphologyAnalyzer

# Analyze word structure
result = analyze_word("စားခဲ့သည်")

print(result.original)    # "စားခဲ့သည်"
print(result.root)        # "စား"
print(result.suffixes)    # ['ခဲ့', 'သည်']
print(result.confidence)  # 0.85
print(result.is_compound) # False
print(result.pos_guesses) # List of POSGuess for the original word

With Dictionary Validation

Validate extracted roots against a dictionary:
# Define a dictionary check function
def is_valid_word(word):
    valid_words = {"စား", "သွား", "လာ", "ကြည့်"}
    return word in valid_words

# Analyze with dictionary validation
result = analyze_word("စားခဲ့သည်", dictionary_check=is_valid_word)
print(result.root)        # "စား" (validated)
print(result.confidence)  # Higher due to dictionary validation

Using Cached Analyzer

For better performance on repeated calls:
from myspellchecker.text.morphology import get_cached_analyzer

# Get analyzer with Stemmer integration (LRU caching)
analyzer = get_cached_analyzer()

# Or use convenience function with caching
result = analyze_word("စားခဲ့သည်", use_cache=True)

Numeral Detection

Detect Myanmar numerals (digits and words):
from myspellchecker.text.morphology import is_numeral_word, get_numeral_pos_guess

# Check for numerals
is_numeral_word("၁၂၃")   # True (Myanmar digits)
is_numeral_word("သုံး")   # True (numeral word "three")
is_numeral_word("စား")   # False

# Get POS guess for numerals
guess = get_numeral_pos_guess("၁၂၃")
print(guess.tag)         # "NUM"
print(guess.confidence)  # 0.99 (very high for digits)

guess = get_numeral_pos_guess("သုံး")
print(guess.confidence)  # 0.95 (slightly lower for words)

POSGuess Data Structure

Results include detailed reasoning:
from myspellchecker.text.morphology import POSGuess

# POSGuess attributes
guess = POSGuess(
    tag="V",
    confidence=0.85,
    reason='verb suffix "ခဲ့"'
)

print(guess.tag)        # "V"
print(guess.confidence) # 0.85
print(guess.reason)     # 'verb suffix "ခဲ့"'

WordAnalysis Data Structure

Complete word decomposition:
from myspellchecker.text.morphology import WordAnalysis

# WordAnalysis attributes
analysis = WordAnalysis(
    original="စားခဲ့သည်",
    root="စား",
    suffixes=["ခဲ့", "သည်"],
    pos_guesses=[POSGuess(...)],
    confidence=0.85,
    is_compound=False
)

POS Tag Priority

When multiple suffixes match, tags are prioritized:
PriorityTagDescription
0NUMNumerals (highest)
1P_SENTSentence particles
2P_MODModifier particles
3P_LOCLocation particles
4P_SUBJSubject particles
5P_OBJObject particles
6VVerbs
7NNouns
8ADJAdjectives
9ADVAdverbs

Confidence Scoring

Confidence is based on:
  1. Numeral detection: 0.95-0.99 (highest)
  2. Proper noun suffixes: 0.85-0.95
  3. Prefix patterns (e.g., အ → Noun): 0.60-0.75
  4. Suffix length ratio: Longer suffix matches = higher confidence
  5. Tag priority: Tie-breaker for similar confidence
# Example confidence calculation
analyzer = MorphologyAnalyzer()

# Long suffix = higher confidence
guesses = analyzer.guess_pos_ranked("စားခဲ့သည်")
# "သည်" is longer relative to word, higher confidence

# Prefix-based inference
guesses = analyzer.guess_pos_ranked("အလုပ်")
# "အ" prefix indicates noun, ~0.60-0.75 confidence

Integration with Stemmer

For performance-critical applications, integrate with the Stemmer:
from myspellchecker.text.morphology import MorphologyAnalyzer
from myspellchecker.text.stemmer import Stemmer

# Create analyzer with Stemmer (has LRU cache)
stemmer = Stemmer()
analyzer = MorphologyAnalyzer(stemmer=stemmer)

# Suffix stripping is now cached
result1 = analyzer.analyze_word("စားခဲ့သည်")  # Computes
result2 = analyzer.analyze_word("စားခဲ့သည်")  # Returns cached result

Configuration

The MorphologyConfig controls confidence values used in OOV recovery and suffix-based POS guessing. Pass it to MorphologyAnalyzer to tune morphological analysis behavior.
from myspellchecker.core.config import MorphologyConfig
from myspellchecker.text.morphology import MorphologyAnalyzer

morph_config = MorphologyConfig(
    # POS guessing confidence multipliers
    particle_confidence_boost=1.2,  # Multiplier for particle suffix confidence
    particle_confidence_cap=1.0,    # Max confidence after particle boost
    verb_suffix_weight=0.9,         # Weight for verb suffix scoring
    noun_suffix_weight=0.85,        # Weight for noun suffix scoring
    adverb_suffix_weight=0.8,       # Weight for adverb suffix scoring

    # OOV recovery confidence
    oov_base_confidence=0.3,        # Base confidence for suffix analysis
    oov_scale_factor=0.7,           # Scale factor for suffix ratio contribution
    oov_cap=0.95,                   # Max confidence from suffix analysis alone
    dictionary_boost=0.2,           # Boost when dictionary confirms root
    dictionary_cap=0.98,            # Max confidence after dictionary boost
    fallback_with_dict=0.5,         # Confidence: no suffixes but root in dict
    fallback_without_dict=0.2,      # Confidence: no suffixes, unknown root

    # Safety limit
    max_suffix_strip_iterations=5,  # Max iterations for suffix stripping
)

analyzer = MorphologyAnalyzer(morphology_config=morph_config)
FieldDefaultDescription
particle_confidence_boost1.2Multiplier for particle suffix confidence (particles are reliable indicators)
particle_confidence_cap1.0Maximum confidence after particle boost
verb_suffix_weight0.9Weight for verb suffix confidence scoring
noun_suffix_weight0.85Weight for noun suffix confidence scoring
adverb_suffix_weight0.8Weight for adverb suffix confidence scoring
oov_base_confidence0.3Base confidence for OOV suffix analysis
oov_scale_factor0.7Scale factor for suffix ratio contribution
oov_cap0.95Maximum confidence from suffix analysis alone
dictionary_boost0.2Confidence boost when dictionary confirms the extracted root
dictionary_cap0.98Maximum confidence after dictionary boost
fallback_with_dict0.5Confidence when no suffixes found but root is in dictionary
fallback_without_dict0.2Confidence when no suffixes found and root is unknown
max_suffix_strip_iterations5Maximum iterations for suffix stripping (prevents infinite loops)

Custom Grammar Config Path

Load morphology rules from a custom config directory:
# Use custom config path
analyzer = MorphologyAnalyzer(config_path="/path/to/grammar/config")

Suffix Categories

The analyzer recognizes these suffix types:

Verb Suffixes

  • ခဲ့ (past tense)
  • နေ (progressive)
  • မည် (future)
  • ပြီ (completion)
  • ရ (potential)

Noun Suffixes

  • များ (plural)
  • ချင်း (comparative)
  • လောက် (approximation)

Particle Suffixes

  • သည် (formal ending)
  • တယ် (colloquial ending)
  • မှာ (location)
  • ကို (object marker)

Adverb Suffixes

  • စွာ (manner)
  • အောင် (result)

Morphological Synthesis

In addition to morphological analysis (decomposition), the library provides morphological synthesis (validation of productive word formation). These modules validate OOV words formed through compounding and reduplication.

ReduplicationEngine

Validates words formed by repeating known dictionary words:
from myspellchecker.text.reduplication import ReduplicationEngine

engine = ReduplicationEngine(
    segmenter=segmenter,
    min_base_frequency=5,       # Minimum frequency for base word
    cache_size=1024,
    allowed_base_pos=frozenset({"V", "ADJ", "ADV", "N"}),
)

result = engine.analyze(
    "ကောင်းကောင်း",
    dictionary_check=lambda w: provider.is_valid_word(w),
    frequency_check=lambda w: provider.get_word_frequency(w),
    pos_check=lambda w: get_pos(w),
)

if result and result.is_valid:
    print(f"Pattern: {result.pattern}")     # "AB" (AA reduplication)
    print(f"Base: {result.base_word}")       # "ကောင်း"
    print(f"Confidence: {result.confidence}") # 0.90+
Supported patterns:
PatternExampleDescription
AAကောင်းကောင်းSimple repetition
AABBသေသေချာချာEach syllable doubles (A-A-B-B)
ABABခဏခဏWhole word repeats (AB-AB)
RHYMEရှုပ်ယှက်Known rhyme pairs

CompoundResolver

Validates compound words by splitting into known dictionary morphemes using dynamic programming:
from myspellchecker.text.compound_resolver import CompoundResolver

resolver = CompoundResolver(
    segmenter=segmenter,
    min_morpheme_frequency=10,  # Minimum frequency per morpheme
    max_parts=4,                # Maximum parts in compound
    cache_size=1024,
)

result = resolver.resolve(
    "ကျောင်းသား",
    dictionary_check=lambda w: provider.is_valid_word(w),
    frequency_check=lambda w: provider.get_word_frequency(w),
    pos_check=lambda w: get_pos(w),
)

if result and result.is_valid:
    print(f"Parts: {result.parts}")      # ["ကျောင်း", "သား"]
    print(f"Pattern: {result.pattern}")   # "N+N"
    print(f"Confidence: {result.confidence}")
Allowed compound patterns (13 patterns):
PatternDescription
N+NNoun + Noun (e.g., ကျောင်းသား student)
V+VVerb + Verb (e.g., စားသောက် eat & drink)
N+VNoun + Verb (e.g., ရေချိုး bathe)
V+NVerb + Noun (e.g., စားခန်း dining room)
ADJ+NAdjective + Noun (e.g., ကြီးမား big city)
N+ADJNoun + Adjective
ADJ+ADJAdjective + Adjective
V+ADJVerb + Adjective
ADJ+VAdjective + Verb
ADV+VAdverb + Verb
ADV+NAdverb + Noun
ADV+ADVAdverb + Adverb
TN+NTemporal Noun + Noun
Each pattern has a morphotactic bonus applied during DP scoring (e.g., N+N: +0.10, ADJ+N: +0.08) that favors linguistically common combinations. The DP algorithm penalizes additional parts beyond 2 via a configurable parts_penalty_multiplier (default: 2.0).

Integration

Both engines are automatically integrated into WordValidator when enabled via config:
config = SpellCheckerConfig(
    validation=ValidationConfig(
        use_reduplication_validation=True,   # Enabled by default
        use_compound_synthesis=True,         # Enabled by default
    )
)

Relationship to MorphologyAnalyzer

ModulePurposeWhen Used
text/morphology.pyPOS guessing, suffix stripping, OOV decompositionSuggestion generation (analyzing unknown words)
text/reduplication.pyValidate productive reduplicationsBefore error creation (suppress false positives)
text/compound_resolver.pyValidate productive compoundsBefore error creation (suppress false positives)

See Also