Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Myanmar frequently creates valid words through compounding (joining morphemes) and reduplication (repeating syllables). These productive processes generate words that may not appear in any dictionary but are perfectly valid. mySpellChecker validates these OOV (out-of-vocabulary) words to suppress false positive spelling errors.
Compound Resolution
The CompoundResolver validates OOV words by splitting them into known dictionary morphemes using dynamic programming for optimal segmentation.
Common Compound Patterns
| Pattern | Example | Meaning |
|---|
| N+N | ကျောင်း + သား → ကျောင်းသား | student |
| V+V | စား + သောက် → စားသောက် | eat and drink |
| ADJ+N | ကောင်း + ကျိုး → ကောင်းကျိုး | benefit |
| N+V | လက် + ခံ → လက်ခံ | accept |
| ADV+V | သေ + ချာ → သေချာ | careful |
How It Works
- Segment the OOV word into syllables
- Use dynamic programming to find optimal splits into dictionary morphemes
- Look up POS tags for each morpheme
- Validate the POS pattern against allowed compound patterns (from
morphotactics.yaml)
- Score based on morpheme frequencies and pattern bonuses
Usage
from myspellchecker.text.compound_resolver import CompoundResolver
resolver = CompoundResolver(
segmenter=segmenter,
min_morpheme_frequency=10,
max_parts=4,
)
result = resolver.resolve(
word="ကျောင်းသား",
dictionary_check=provider.is_valid_word,
frequency_check=provider.get_frequency,
pos_check=provider.get_pos,
)
if result and result.is_valid:
print(f"Parts: {result.parts}") # ["ကျောင်း", "သား"]
print(f"Pattern: {result.pattern}") # "N+N"
print(f"Confidence: {result.confidence}")
Configuration
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import CompoundResolverConfig
config = SpellCheckerConfig(
compound_resolver=CompoundResolverConfig(
min_morpheme_frequency=10, # Minimum frequency per morpheme
max_parts=4, # Maximum compound parts
cache_size=1024, # LRU cache size
)
)
Parameters
| Parameter | Default | Description |
|---|
min_morpheme_frequency | 10 | Minimum corpus frequency for each morpheme |
max_parts | 4 | Maximum number of parts in a compound |
cache_size | 1024 | LRU cache entries for resolved compounds |
Morphotactic Rules
Compound POS patterns are defined in rules/morphotactics.yaml:
compound_patterns:
- pattern: "N+N"
enabled: true
- pattern: "V+V"
enabled: true
- pattern: "ADJ+N"
enabled: true
# ...
blocked_patterns:
- "PART+PART" # Particles don't compound
morphotactic_bonuses:
"N+N": 0.10 # Highest bonus — most common pattern
"ADJ+N": 0.08
"V+V": 0.05
CompoundSplit Result
from myspellchecker.text.compound_resolver import CompoundSplit
# Returned by resolver.resolve()
@dataclass(frozen=True)
class CompoundSplit:
word: str # Original compound word
parts: list[str] # Morpheme strings
part_pos: list[str | None] # POS tag per part
pattern: str # e.g., "N+N"
confidence: float # Split confidence score
is_valid: bool # Whether split is valid
Reduplication
The ReduplicationEngine validates OOV words formed by reduplicating known dictionary words — a productive morphological process in Myanmar.
Reduplication Patterns
| Pattern | Structure | Example | Meaning |
|---|
| AA | Syllable repeats | ကောင်းကောင်း | ”well” (from ကောင်း “good”) |
| AABB | Each syllable doubles | သေသေချာချာ | ”very carefully” |
| ABAB | Whole word repeats | ခဏခဏ | ”frequently” |
| RHYME | Known rhyme pairs | From grammar patterns | Fixed expressions |
How It Works
- Check against known rhyme reduplication patterns (fast path)
- Segment into syllables and detect the reduplication pattern
- Extract the base word from the pattern
- Validate: base must be in dictionary with sufficient frequency
- Check POS: only V, ADJ, ADV, N can productively reduplicate
Usage
from myspellchecker.text.reduplication import ReduplicationEngine
engine = ReduplicationEngine(
segmenter=segmenter,
min_base_frequency=5,
)
result = engine.analyze(
word="ကောင်းကောင်း",
dictionary_check=provider.is_valid_word,
frequency_check=provider.get_frequency,
pos_check=provider.get_pos,
)
if result and result.is_valid:
print(f"Pattern: {result.pattern}") # "AA"
print(f"Base: {result.base_word}") # "ကောင်း"
print(f"POS: {result.pos_tag}") # "ADJ"
Configuration
from myspellchecker.core.config.algorithm_configs import ReduplicationConfig
config = SpellCheckerConfig(
reduplication=ReduplicationConfig(
min_base_frequency=5, # Minimum base word frequency
cache_size=1024, # LRU cache size
)
)
ReduplicationResult
@dataclass(frozen=True)
class ReduplicationResult:
word: str # Original word
pattern: str # AA, AABB, ABAB, or RHYME
base_word: str # Base word
is_valid: bool # Whether valid reduplication
pos_tag: str | None # POS of base word
confidence: float # Analysis confidence
Integration with Word Validation
Both engines are integrated into the word validation pipeline. When WordValidator encounters an OOV word, it checks compound resolution and reduplication before flagging a spelling error:
from myspellchecker.core import SpellCheckerBuilder
checker = (
SpellCheckerBuilder()
.with_compound_resolver(True)
.with_reduplication(True)
.build()
)
# "ကျောင်းသား" won't be flagged as a spelling error
# because CompoundResolver validates it as N+N compound
result = checker.check("ကျောင်းသားတွေ")
See Also