Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
When the spell checker encounters an unknown word, the morphology module decomposes it into roots and affixes to infer likely POS tags and generate better suggestions.
Overview
Myanmar words are often formed by combining roots with suffixes and prefixes. The MorphologyAnalyzer can:
- Guess POS tags based on word morphology
- Decompose words into roots and suffixes
- Identify numeral words
- Handle ambiguous words with multiple POS possibilities
- Extract morphological patterns for OOV recovery
MorphologyAnalyzer
The main class for morphological analysis.
Basic POS Guessing
from myspellchecker.text.morphology import MorphologyAnalyzer
analyzer = MorphologyAnalyzer()
# Get all possible POS tags (set)
tags = analyzer.guess_pos("စားပြီ")
print(tags) # {'V', 'P_SENT'}
# Get ranked POS guesses with confidence
guesses = analyzer.guess_pos_ranked("စားပြီ")
for guess in guesses:
print(f"{guess.tag}: {guess.confidence:.2f} ({guess.reason})")
# Output:
# P_SENT: 0.67 (particle suffix "ပြီ")
# V: 0.45 (verb suffix "ပြီ")
# Get the most likely POS tag
best = analyzer.guess_pos_best("စားပြီ")
print(best) # "P_SENT"
Multi-POS Support
Handle words with multiple possible parts of speech:
# Get all possible tags for ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("ကြီး")
print(tags) # frozenset({'N', 'V', 'ADJ'})
print(confidence) # 0.70 (medium - needs context)
print(source) # "ambiguous_registry"
# Non-ambiguous words
tags, confidence, source = analyzer.guess_pos_multi("အလုပ်")
print(tags) # frozenset({'N'})
print(confidence) # 0.75
print(source) # "morphological_inference"
Word Analysis
Decompose words into roots and suffixes:
from myspellchecker.text.morphology import analyze_word, MorphologyAnalyzer
# Analyze word structure
result = analyze_word("စားခဲ့သည်")
print(result.original) # "စားခဲ့သည်"
print(result.root) # "စား"
print(result.suffixes) # ['ခဲ့', 'သည်']
print(result.confidence) # 0.85
print(result.is_compound) # False
print(result.pos_guesses) # List of POSGuess for the original word
With Dictionary Validation
Validate extracted roots against a dictionary:
# Define a dictionary check function
def is_valid_word(word):
valid_words = {"စား", "သွား", "လာ", "ကြည့်"}
return word in valid_words
# Analyze with dictionary validation
result = analyze_word("စားခဲ့သည်", dictionary_check=is_valid_word)
print(result.root) # "စား" (validated)
print(result.confidence) # Higher due to dictionary validation
Using Cached Analyzer
For better performance on repeated calls:
from myspellchecker.text.morphology import get_cached_analyzer
# Get analyzer with Stemmer integration (LRU caching)
analyzer = get_cached_analyzer()
# Or use convenience function with caching
result = analyze_word("စားခဲ့သည်", use_cache=True)
Numeral Detection
Detect Myanmar numerals (digits and words):
from myspellchecker.text.morphology import is_numeral_word, get_numeral_pos_guess
# Check for numerals
is_numeral_word("၁၂၃") # True (Myanmar digits)
is_numeral_word("သုံး") # True (numeral word "three")
is_numeral_word("စား") # False
# Get POS guess for numerals
guess = get_numeral_pos_guess("၁၂၃")
print(guess.tag) # "NUM"
print(guess.confidence) # 0.99 (very high for digits)
guess = get_numeral_pos_guess("သုံး")
print(guess.confidence) # 0.95 (slightly lower for words)
POSGuess Data Structure
Results include detailed reasoning:
from myspellchecker.text.morphology import POSGuess
# POSGuess attributes
guess = POSGuess(
tag="V",
confidence=0.85,
reason='verb suffix "ခဲ့"'
)
print(guess.tag) # "V"
print(guess.confidence) # 0.85
print(guess.reason) # 'verb suffix "ခဲ့"'
WordAnalysis Data Structure
Complete word decomposition:
from myspellchecker.text.morphology import WordAnalysis
# WordAnalysis attributes
analysis = WordAnalysis(
original="စားခဲ့သည်",
root="စား",
suffixes=["ခဲ့", "သည်"],
pos_guesses=[POSGuess(...)],
confidence=0.85,
is_compound=False
)
POS Tag Priority
When multiple suffixes match, tags are prioritized:
| Priority | Tag | Description |
|---|
| 0 | NUM | Numerals (highest) |
| 1 | P_SENT | Sentence particles |
| 2 | P_MOD | Modifier particles |
| 3 | P_LOC | Location particles |
| 4 | P_SUBJ | Subject particles |
| 5 | P_OBJ | Object particles |
| 6 | V | Verbs |
| 7 | N | Nouns |
| 8 | ADJ | Adjectives |
| 9 | ADV | Adverbs |
Confidence Scoring
Confidence is based on:
- Numeral detection: 0.95-0.99 (highest)
- Proper noun suffixes: 0.85-0.95
- Prefix patterns (e.g., အ → Noun): 0.60-0.75
- Suffix length ratio: Longer suffix matches = higher confidence
- Tag priority: Tie-breaker for similar confidence
# Example confidence calculation
analyzer = MorphologyAnalyzer()
# Long suffix = higher confidence
guesses = analyzer.guess_pos_ranked("စားခဲ့သည်")
# "သည်" is longer relative to word, higher confidence
# Prefix-based inference
guesses = analyzer.guess_pos_ranked("အလုပ်")
# "အ" prefix indicates noun, ~0.60-0.75 confidence
Integration with Stemmer
For performance-critical applications, integrate with the Stemmer:
from myspellchecker.text.morphology import MorphologyAnalyzer
from myspellchecker.text.stemmer import Stemmer
# Create analyzer with Stemmer (has LRU cache)
stemmer = Stemmer()
analyzer = MorphologyAnalyzer(stemmer=stemmer)
# Suffix stripping is now cached
result1 = analyzer.analyze_word("စားခဲ့သည်") # Computes
result2 = analyzer.analyze_word("စားခဲ့သည်") # Returns cached result
Configuration
The MorphologyConfig controls confidence values used in OOV recovery and suffix-based POS guessing. Pass it to MorphologyAnalyzer to tune morphological analysis behavior.
from myspellchecker.core.config import MorphologyConfig
from myspellchecker.text.morphology import MorphologyAnalyzer
morph_config = MorphologyConfig(
# POS guessing confidence multipliers
particle_confidence_boost=1.2, # Multiplier for particle suffix confidence
particle_confidence_cap=1.0, # Max confidence after particle boost
verb_suffix_weight=0.9, # Weight for verb suffix scoring
noun_suffix_weight=0.85, # Weight for noun suffix scoring
adverb_suffix_weight=0.8, # Weight for adverb suffix scoring
# OOV recovery confidence
oov_base_confidence=0.3, # Base confidence for suffix analysis
oov_scale_factor=0.7, # Scale factor for suffix ratio contribution
oov_cap=0.95, # Max confidence from suffix analysis alone
dictionary_boost=0.2, # Boost when dictionary confirms root
dictionary_cap=0.98, # Max confidence after dictionary boost
fallback_with_dict=0.5, # Confidence: no suffixes but root in dict
fallback_without_dict=0.2, # Confidence: no suffixes, unknown root
# Safety limit
max_suffix_strip_iterations=5, # Max iterations for suffix stripping
)
analyzer = MorphologyAnalyzer(morphology_config=morph_config)
| Field | Default | Description |
|---|
particle_confidence_boost | 1.2 | Multiplier for particle suffix confidence (particles are reliable indicators) |
particle_confidence_cap | 1.0 | Maximum confidence after particle boost |
verb_suffix_weight | 0.9 | Weight for verb suffix confidence scoring |
noun_suffix_weight | 0.85 | Weight for noun suffix confidence scoring |
adverb_suffix_weight | 0.8 | Weight for adverb suffix confidence scoring |
oov_base_confidence | 0.3 | Base confidence for OOV suffix analysis |
oov_scale_factor | 0.7 | Scale factor for suffix ratio contribution |
oov_cap | 0.95 | Maximum confidence from suffix analysis alone |
dictionary_boost | 0.2 | Confidence boost when dictionary confirms the extracted root |
dictionary_cap | 0.98 | Maximum confidence after dictionary boost |
fallback_with_dict | 0.5 | Confidence when no suffixes found but root is in dictionary |
fallback_without_dict | 0.2 | Confidence when no suffixes found and root is unknown |
max_suffix_strip_iterations | 5 | Maximum iterations for suffix stripping (prevents infinite loops) |
Custom Grammar Config Path
Load morphology rules from a custom config directory:
# Use custom config path
analyzer = MorphologyAnalyzer(config_path="/path/to/grammar/config")
Suffix Categories
The analyzer recognizes these suffix types:
Verb Suffixes
- ခဲ့ (past tense)
- နေ (progressive)
- မည် (future)
- ပြီ (completion)
- ရ (potential)
Noun Suffixes
- များ (plural)
- ချင်း (comparative)
- လောက် (approximation)
Particle Suffixes
- သည် (formal ending)
- တယ် (colloquial ending)
- မှာ (location)
- ကို (object marker)
Adverb Suffixes
- စွာ (manner)
- အောင် (result)
Morphological Synthesis
In addition to morphological analysis (decomposition), the library provides morphological synthesis (validation of productive word formation). These modules validate OOV words formed through compounding and reduplication.
ReduplicationEngine
Validates words formed by repeating known dictionary words:
from myspellchecker.text.reduplication import ReduplicationEngine
engine = ReduplicationEngine(
segmenter=segmenter,
min_base_frequency=5, # Minimum frequency for base word
cache_size=1024,
allowed_base_pos=frozenset({"V", "ADJ", "ADV", "N"}),
)
result = engine.analyze(
"ကောင်းကောင်း",
dictionary_check=lambda w: provider.is_valid_word(w),
frequency_check=lambda w: provider.get_word_frequency(w),
pos_check=lambda w: get_pos(w),
)
if result and result.is_valid:
print(f"Pattern: {result.pattern}") # "AB" (AA reduplication)
print(f"Base: {result.base_word}") # "ကောင်း"
print(f"Confidence: {result.confidence}") # 0.90+
Supported patterns:
| Pattern | Example | Description |
|---|
| AA | ကောင်းကောင်း | Simple repetition |
| AABB | သေသေချာချာ | Each syllable doubles (A-A-B-B) |
| ABAB | ခဏခဏ | Whole word repeats (AB-AB) |
| RHYME | ရှုပ်ယှက် | Known rhyme pairs |
CompoundResolver
Validates compound words by splitting into known dictionary morphemes using dynamic programming:
from myspellchecker.text.compound_resolver import CompoundResolver
resolver = CompoundResolver(
segmenter=segmenter,
min_morpheme_frequency=10, # Minimum frequency per morpheme
max_parts=4, # Maximum parts in compound
cache_size=1024,
)
result = resolver.resolve(
"ကျောင်းသား",
dictionary_check=lambda w: provider.is_valid_word(w),
frequency_check=lambda w: provider.get_word_frequency(w),
pos_check=lambda w: get_pos(w),
)
if result and result.is_valid:
print(f"Parts: {result.parts}") # ["ကျောင်း", "သား"]
print(f"Pattern: {result.pattern}") # "N+N"
print(f"Confidence: {result.confidence}")
Allowed compound patterns (13 patterns):
| Pattern | Description |
|---|
| N+N | Noun + Noun (e.g., ကျောင်းသား student) |
| V+V | Verb + Verb (e.g., စားသောက် eat & drink) |
| N+V | Noun + Verb (e.g., ရေချိုး bathe) |
| V+N | Verb + Noun (e.g., စားခန်း dining room) |
| ADJ+N | Adjective + Noun (e.g., ကြီးမား big city) |
| N+ADJ | Noun + Adjective |
| ADJ+ADJ | Adjective + Adjective |
| V+ADJ | Verb + Adjective |
| ADJ+V | Adjective + Verb |
| ADV+V | Adverb + Verb |
| ADV+N | Adverb + Noun |
| ADV+ADV | Adverb + Adverb |
| TN+N | Temporal Noun + Noun |
Each pattern has a morphotactic bonus applied during DP scoring (e.g., N+N: +0.10, ADJ+N: +0.08) that favors linguistically common combinations. The DP algorithm penalizes additional parts beyond 2 via a configurable parts_penalty_multiplier (default: 2.0).
Integration
Both engines are automatically integrated into WordValidator when enabled via config:
config = SpellCheckerConfig(
validation=ValidationConfig(
use_reduplication_validation=True, # Enabled by default
use_compound_synthesis=True, # Enabled by default
)
)
Relationship to MorphologyAnalyzer
| Module | Purpose | When Used |
|---|
text/morphology.py | POS guessing, suffix stripping, OOV decomposition | Suggestion generation (analyzing unknown words) |
text/reduplication.py | Validate productive reduplications | Before error creation (suppress false positives) |
text/compound_resolver.py | Validate productive compounds | Before error creation (suppress false positives) |
See Also