Compound Resolution & Reduplication

Myanmar frequently creates valid words through compounding (joining morphemes) and reduplication (repeating syllables). These productive processes generate words that may not appear in any dictionary but are perfectly valid. mySpellChecker validates these OOV (out-of-vocabulary) words to suppress false positive spelling errors.

Compound Resolution

The CompoundResolver validates OOV words by splitting them into known dictionary morphemes using dynamic programming for optimal segmentation.

Common Compound Patterns

Pattern	Example	Meaning
N+N	ကျောင်း + သား → ကျောင်းသား	student
V+V	စား + သောက် → စားသောက်	eat and drink
ADJ+N	ကောင်း + ကျိုး → ကောင်းကျိုး	benefit
N+V	လက် + ခံ → လက်ခံ	accept
ADV+V	သေ + ချာ → သေချာ	careful

How It Works

Segment the OOV word into syllables
Use dynamic programming to find optimal splits into dictionary morphemes
Look up POS tags for each morpheme
Validate the POS pattern against allowed compound patterns (from morphotactics.yaml)
Score based on morpheme frequencies and pattern bonuses

Usage

from myspellchecker.text.compound_resolver import CompoundResolver

resolver = CompoundResolver(
    segmenter=segmenter,
    min_morpheme_frequency=10,
    max_parts=4,
)

result = resolver.resolve(
    word="ကျောင်းသား",
    dictionary_check=provider.is_valid_word,
    frequency_check=provider.get_frequency,
    pos_check=provider.get_pos,
)

if result and result.is_valid:
    print(f"Parts: {result.parts}")        # ["ကျောင်း", "သား"]
    print(f"Pattern: {result.pattern}")    # "N+N"
    print(f"Confidence: {result.confidence}")

Configuration

from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import CompoundResolverConfig

config = SpellCheckerConfig(
    compound_resolver=CompoundResolverConfig(
        min_morpheme_frequency=10,  # Minimum frequency per morpheme
        max_parts=4,                # Maximum compound parts
        cache_size=1024,            # LRU cache size
    )
)

Parameters

Parameter	Default	Description
`min_morpheme_frequency`	10	Minimum corpus frequency for each morpheme
`max_parts`	4	Maximum number of parts in a compound
`cache_size`	1024	LRU cache entries for resolved compounds

Morphotactic Rules

Compound POS patterns are defined in rules/morphotactics.yaml:

compound_patterns:
  - pattern: "N+N"
    enabled: true
  - pattern: "V+V"
    enabled: true
  - pattern: "ADJ+N"
    enabled: true
  # ...

blocked_patterns:
  - "PART+PART"  # Particles don't compound

morphotactic_bonuses:
  "N+N": 0.10    # Highest bonus — most common pattern
  "ADJ+N": 0.08
  "V+V": 0.05

CompoundSplit Result

from myspellchecker.text.compound_resolver import CompoundSplit

# Returned by resolver.resolve()
@dataclass(frozen=True)
class CompoundSplit:
    word: str                    # Original compound word
    parts: list[str]             # Morpheme strings
    part_pos: list[str | None]   # POS tag per part
    pattern: str                 # e.g., "N+N"
    confidence: float            # Split confidence score
    is_valid: bool               # Whether split is valid

Reduplication

The ReduplicationEngine validates OOV words formed by reduplicating known dictionary words — a productive morphological process in Myanmar.

Reduplication Patterns

Pattern	Structure	Example	Meaning
AA	Syllable repeats	ကောင်းကောင်း	”well” (from ကောင်း “good”)
AABB	Each syllable doubles	သေသေချာချာ	”very carefully”
ABAB	Whole word repeats	ခဏခဏ	”frequently”
RHYME	Known rhyme pairs	From grammar patterns	Fixed expressions

How It Works

Check against known rhyme reduplication patterns (fast path)
Segment into syllables and detect the reduplication pattern
Extract the base word from the pattern
Validate: base must be in dictionary with sufficient frequency
Check POS: only V, ADJ, ADV, N can productively reduplicate

Usage

from myspellchecker.text.reduplication import ReduplicationEngine

engine = ReduplicationEngine(
    segmenter=segmenter,
    min_base_frequency=5,
)

result = engine.analyze(
    word="ကောင်းကောင်း",
    dictionary_check=provider.is_valid_word,
    frequency_check=provider.get_frequency,
    pos_check=provider.get_pos,
)

if result and result.is_valid:
    print(f"Pattern: {result.pattern}")      # "AA"
    print(f"Base: {result.base_word}")       # "ကောင်း"
    print(f"POS: {result.pos_tag}")          # "ADJ"

Configuration

from myspellchecker.core.config.algorithm_configs import ReduplicationConfig

config = SpellCheckerConfig(
    reduplication=ReduplicationConfig(
        min_base_frequency=5,   # Minimum base word frequency
        cache_size=1024,        # LRU cache size
    )
)

ReduplicationResult

@dataclass(frozen=True)
class ReduplicationResult:
    word: str           # Original word
    pattern: str        # AA, AABB, ABAB, or RHYME
    base_word: str      # Base word
    is_valid: bool      # Whether valid reduplication
    pos_tag: str | None # POS of base word
    confidence: float   # Analysis confidence

Integration with Word Validation

Both engines are integrated into the word validation pipeline. When WordValidator encounters an OOV word, it checks compound resolution and reduplication before flagging a spelling error:

from myspellchecker.core import SpellCheckerBuilder

checker = (
    SpellCheckerBuilder()
    .with_compound_resolver(True)
    .with_reduplication(True)
    .build()
)

# "ကျောင်းသား" won't be flagged as a spelling error
# because CompoundResolver validates it as N+N compound
result = checker.check("ကျောင်းသားတွေ")