Skip to main content
The morphology module provides word structure analysis for Myanmar text, enabling POS inference for out-of-vocabulary (OOV) words and word decomposition.

Overview

Myanmar words are often formed by combining roots with suffixes and prefixes. The MorphologyAnalyzer can:
  • Guess POS tags based on word morphology
  • Decompose words into roots and suffixes
  • Identify numeral words
  • Handle ambiguous words with multiple POS possibilities
  • Extract morphological patterns for OOV recovery

MorphologyAnalyzer

The main class for morphological analysis.

Basic POS Guessing

from myspellchecker.text.morphology import MorphologyAnalyzer

analyzer = MorphologyAnalyzer()

# Get all possible POS tags (set)
tags = analyzer.guess_pos("စားပြီ")
print(tags)  # {'V', 'P_SENT'}

# Get ranked POS guesses with confidence
guesses = analyzer.guess_pos_ranked("စားပြီ")
for guess in guesses:
    print(f"{guess.tag}: {guess.confidence:.2f} ({guess.reason})")
# Output:
# P_SENT: 0.67 (particle suffix "ပြီ")
# V: 0.45 (verb suffix "ပြီ")

# Get the most likely POS tag
best = analyzer.guess_pos_best("စားပြီ")
print(best)  # "P_SENT"

Multi-POS Support

Handle words with multiple possible parts of speech:
# Get all possible tags for ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("ကြီး")
print(tags)       # frozenset({'N', 'V', 'ADJ'})
print(confidence) # 0.70 (medium - needs context)
print(source)     # "ambiguous_registry"

# Non-ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("အလုပ်")
print(tags)       # frozenset({'N'})
print(confidence) # 0.75
print(source)     # "morphological_inference"

Comprehensive POS Inference

# Get detailed inference results
result = analyzer.infer_pos_with_details("ကြီး")

print(result["word"])                    # "ကြီး"
print(result["tags"])                    # frozenset({'N', 'V', 'ADJ'})
print(result["best_tag"])                # "N" or most likely
print(result["confidence"])              # 0.70
print(result["source"])                  # "ambiguous_registry"
print(result["is_ambiguous"])            # True
print(result["requires_disambiguation"]) # True
print(result["all_guesses"])             # List of POSGuess objects

Word Analysis

Decompose words into roots and suffixes:
from myspellchecker.text.morphology import analyze_word, MorphologyAnalyzer

# Analyze word structure
result = analyze_word("စားခဲ့သည်")

print(result.original)    # "စားခဲ့သည်"
print(result.root)        # "စား"
print(result.suffixes)    # ['ခဲ့', 'သည်']
print(result.confidence)  # 0.85
print(result.is_compound) # False
print(result.pos_guesses) # List of POSGuess for the original word

With Dictionary Validation

Validate extracted roots against a dictionary:
# Define a dictionary check function
def is_valid_word(word):
    valid_words = {"စား", "သွား", "လာ", "ကြည့်"}
    return word in valid_words

# Analyze with dictionary validation
result = analyze_word("စားခဲ့သည်", dictionary_check=is_valid_word)
print(result.root)        # "စား" (validated)
print(result.confidence)  # Higher due to dictionary validation

Using Cached Analyzer

For better performance on repeated calls:
from myspellchecker.text.morphology import get_cached_analyzer

# Get analyzer with Stemmer integration (LRU caching)
analyzer = get_cached_analyzer()

# Or use convenience function with caching
result = analyze_word("စားခဲ့သည်", use_cache=True)

Numeral Detection

Detect Myanmar numerals (digits and words):
from myspellchecker.text.morphology import is_numeral_word, get_numeral_pos_guess

# Check for numerals
is_numeral_word("၁၂၃")   # True (Myanmar digits)
is_numeral_word("သုံး")   # True (numeral word "three")
is_numeral_word("စား")   # False

# Get POS guess for numerals
guess = get_numeral_pos_guess("၁၂၃")
print(guess.tag)         # "NUM"
print(guess.confidence)  # 0.99 (very high for digits)

guess = get_numeral_pos_guess("သုံး")
print(guess.confidence)  # 0.95 (slightly lower for words)

POSGuess Data Structure

Results include detailed reasoning:
from myspellchecker.text.morphology import POSGuess

# POSGuess attributes
guess = POSGuess(
    tag="V",
    confidence=0.85,
    reason='verb suffix "ခဲ့"'
)

print(guess.tag)        # "V"
print(guess.confidence) # 0.85
print(guess.reason)     # 'verb suffix "ခဲ့"'

WordAnalysis Data Structure

Complete word decomposition:
from myspellchecker.text.morphology import WordAnalysis

# WordAnalysis attributes
analysis = WordAnalysis(
    original="စားခဲ့သည်",
    root="စား",
    suffixes=["ခဲ့", "သည်"],
    pos_guesses=[POSGuess(...)],
    confidence=0.85,
    is_compound=False
)

POS Tag Priority

When multiple suffixes match, tags are prioritized:
PriorityTagDescription
0NUMNumerals (highest)
1P_SENTSentence particles
2P_MODModifier particles
3P_LOCLocation particles
4P_SUBJSubject particles
5P_OBJObject particles
6VVerbs
7NNouns
8ADJAdjectives
9ADVAdverbs

Confidence Scoring

Confidence is based on:
  1. Numeral detection: 0.95-0.99 (highest)
  2. Proper noun suffixes: 0.85-0.95
  3. Prefix patterns (e.g., အ → Noun): 0.60-0.75
  4. Suffix length ratio: Longer suffix matches = higher confidence
  5. Tag priority: Tie-breaker for similar confidence
# Example confidence calculation
analyzer = MorphologyAnalyzer()

# Long suffix = higher confidence
guesses = analyzer.guess_pos_ranked("စားခဲ့သည်")
# "သည်" is longer relative to word, higher confidence

# Prefix-based inference
guesses = analyzer.guess_pos_ranked("အလုပ်")
# "အ" prefix indicates noun, ~0.60-0.75 confidence

Integration with Stemmer

For performance-critical applications, integrate with the Stemmer:
from myspellchecker.text.morphology import MorphologyAnalyzer
from myspellchecker.text.stemmer import Stemmer

# Create analyzer with Stemmer (has LRU cache)
stemmer = Stemmer()
analyzer = MorphologyAnalyzer(stemmer=stemmer)

# Suffix stripping is now cached
result1 = analyzer.analyze_word("စားခဲ့သည်")  # Computes
result2 = analyzer.analyze_word("စားခဲ့သည်")  # Returns cached result

Custom Configuration

Load morphology rules from custom config:
# Use custom config path
analyzer = MorphologyAnalyzer(config_path="/path/to/grammar/config")

Suffix Categories

The analyzer recognizes these suffix types:

Verb Suffixes

  • ခဲ့ (past tense)
  • နေ (progressive)
  • မည် (future)
  • ပြီ (completion)
  • ရ (potential)

Noun Suffixes

  • များ (plural)
  • ချင်း (comparative)
  • လောက် (approximation)

Particle Suffixes

  • သည် (formal ending)
  • တယ် (colloquial ending)
  • မှာ (location)
  • ကို (object marker)

Adverb Suffixes

  • စွာ (manner)
  • အောင် (result)

Morphological Synthesis

In addition to morphological analysis (decomposition), the library provides morphological synthesis (validation of productive word formation). These modules validate OOV words formed through compounding and reduplication.

ReduplicationEngine

Validates words formed by repeating known dictionary words:
from myspellchecker.text.reduplication import ReduplicationEngine

engine = ReduplicationEngine(
    segmenter=segmenter,
    min_base_frequency=5,       # Minimum frequency for base word
    cache_size=1024,
    allowed_base_pos=frozenset({"V", "ADJ", "ADV", "N"}),
)

result = engine.analyze(
    "ကောင်းကောင်း",
    dictionary_check=lambda w: provider.is_valid_word(w),
    frequency_check=lambda w: provider.get_word_frequency(w),
    pos_check=lambda w: get_pos(w),
)

if result and result.is_valid:
    print(f"Pattern: {result.pattern}")     # "AB" (AA reduplication)
    print(f"Base: {result.base_word}")       # "ကောင်း"
    print(f"Confidence: {result.confidence}") # 0.90+
Supported patterns:
PatternExampleDescription
AAကောင်းကောင်းSimple repetition
AABBသေသေချာချာEach syllable doubles (A-A-B-B)
ABABခဏခဏWhole word repeats (AB-AB)
RHYMEရှုပ်ယှက်Known rhyme pairs

CompoundResolver

Validates compound words by splitting into known dictionary morphemes using dynamic programming:
from myspellchecker.text.compound_resolver import CompoundResolver

resolver = CompoundResolver(
    segmenter=segmenter,
    min_morpheme_frequency=10,  # Minimum frequency per morpheme
    max_parts=4,                # Maximum parts in compound
    cache_size=1024,
)

result = resolver.resolve(
    "ကျောင်းသား",
    dictionary_check=lambda w: provider.is_valid_word(w),
    frequency_check=lambda w: provider.get_word_frequency(w),
    pos_check=lambda w: get_pos(w),
)

if result and result.is_valid:
    print(f"Parts: {result.parts}")      # ["ကျောင်း", "သား"]
    print(f"Pattern: {result.pattern}")   # "N+N"
    print(f"Confidence: {result.confidence}")
Allowed compound patterns:
PatternExampleDescription
N+Nကျောင်းသား (student)Noun + Noun
V+Vစားသောက် (eat & drink)Verb + Verb
N+Vရေချိုး (bathe)Noun + Verb
V+Nစားခန်း (dining room)Verb + Noun
N+ADJမြို့ကြီး (big city)Noun + Adjective
Blocked patterns: P+P, P+N, N+P (particle combinations never form compounds).

Integration

Both engines are automatically integrated into WordValidator when enabled via config:
config = SpellCheckerConfig(
    validation=ValidationConfig(
        use_reduplication_validation=True,   # Enabled by default
        use_compound_synthesis=True,         # Enabled by default
    )
)

Relationship to MorphologyAnalyzer

ModulePurposeWhen Used
text/morphology.pyPOS guessing, suffix stripping, OOV decompositionSuggestion generation (analyzing unknown words)
text/reduplication.pyValidate productive reduplicationsBefore error creation (suppress false positives)
text/compound_resolver.pyValidate productive compoundsBefore error creation (suppress false positives)

See Also