Skip to main content
Myanmar frequently creates valid words through compounding (joining morphemes) and reduplication (repeating syllables). These productive processes generate words that may not appear in any dictionary but are perfectly valid. mySpellChecker validates these OOV (out-of-vocabulary) words to suppress false positive spelling errors.

Compound Resolution

The CompoundResolver validates OOV words by splitting them into known dictionary morphemes using dynamic programming for optimal segmentation.

Common Compound Patterns

PatternExampleMeaning
N+Nကျောင်း + သား → ကျောင်းသားstudent
V+Vစား + သောက် → စားသောက်eat and drink
ADJ+Nကောင်း + ကျိုး → ကောင်းကျိုးbenefit
N+Vလက် + ခံ → လက်ခံaccept
ADV+Vသေ + ချာ → သေချာcareful

How It Works

  1. Segment the OOV word into syllables
  2. Use dynamic programming to find optimal splits into dictionary morphemes
  3. Look up POS tags for each morpheme
  4. Validate the POS pattern against allowed compound patterns (from morphotactics.yaml)
  5. Score based on morpheme frequencies and pattern bonuses

Usage

from myspellchecker.text.compound_resolver import CompoundResolver

resolver = CompoundResolver(
    segmenter=segmenter,
    min_morpheme_frequency=10,
    max_parts=4,
)

result = resolver.resolve(
    word="ကျောင်းသား",
    dictionary_check=provider.is_valid_word,
    frequency_check=provider.get_frequency,
    pos_check=provider.get_pos,
)

if result and result.is_valid:
    print(f"Parts: {result.parts}")        # ["ကျောင်း", "သား"]
    print(f"Pattern: {result.pattern}")    # "N+N"
    print(f"Confidence: {result.confidence}")

Configuration

from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import CompoundResolverConfig

config = SpellCheckerConfig(
    compound_resolver=CompoundResolverConfig(
        min_morpheme_frequency=10,  # Minimum frequency per morpheme
        max_parts=4,                # Maximum compound parts
        cache_size=1024,            # LRU cache size
    )
)

Parameters

ParameterDefaultDescription
min_morpheme_frequency10Minimum corpus frequency for each morpheme
max_parts4Maximum number of parts in a compound
cache_size1024LRU cache entries for resolved compounds

Morphotactic Rules

Compound POS patterns are defined in rules/morphotactics.yaml:
compound_patterns:
  - pattern: "N+N"
    enabled: true
  - pattern: "V+V"
    enabled: true
  - pattern: "ADJ+N"
    enabled: true
  # ...

blocked_patterns:
  - "PART+PART"  # Particles don't compound

morphotactic_bonuses:
  "N+N": 0.10    # Highest bonus — most common pattern
  "ADJ+N": 0.08
  "V+V": 0.05

CompoundSplit Result

from myspellchecker.text.compound_resolver import CompoundSplit

# Returned by resolver.resolve()
@dataclass(frozen=True)
class CompoundSplit:
    word: str                    # Original compound word
    parts: list[str]             # Morpheme strings
    part_pos: list[str | None]   # POS tag per part
    pattern: str                 # e.g., "N+N"
    confidence: float            # Split confidence score
    is_valid: bool               # Whether split is valid

Reduplication

The ReduplicationEngine validates OOV words formed by reduplicating known dictionary words — a productive morphological process in Myanmar.

Reduplication Patterns

PatternStructureExampleMeaning
AASyllable repeatsကောင်းကောင်း”well” (from ကောင်း “good”)
AABBEach syllable doublesသေသေချာချာ”very carefully”
ABABWhole word repeatsခဏခဏ”frequently”
RHYMEKnown rhyme pairsFrom grammar patternsFixed expressions

How It Works

  1. Check against known rhyme reduplication patterns (fast path)
  2. Segment into syllables and detect the reduplication pattern
  3. Extract the base word from the pattern
  4. Validate: base must be in dictionary with sufficient frequency
  5. Check POS: only V, ADJ, ADV, N can productively reduplicate

Usage

from myspellchecker.text.reduplication import ReduplicationEngine

engine = ReduplicationEngine(
    segmenter=segmenter,
    min_base_frequency=5,
)

result = engine.analyze(
    word="ကောင်းကောင်း",
    dictionary_check=provider.is_valid_word,
    frequency_check=provider.get_frequency,
    pos_check=provider.get_pos,
)

if result and result.is_valid:
    print(f"Pattern: {result.pattern}")      # "AA"
    print(f"Base: {result.base_word}")       # "ကောင်း"
    print(f"POS: {result.pos_tag}")          # "ADJ"

Configuration

from myspellchecker.core.config.algorithm_configs import ReduplicationConfig

config = SpellCheckerConfig(
    reduplication=ReduplicationConfig(
        min_base_frequency=5,   # Minimum base word frequency
        cache_size=1024,        # LRU cache size
    )
)

ReduplicationResult

@dataclass(frozen=True)
class ReduplicationResult:
    word: str           # Original word
    pattern: str        # AA, AABB, ABAB, or RHYME
    base_word: str      # Base word
    is_valid: bool      # Whether valid reduplication
    pos_tag: str | None # POS of base word
    confidence: float   # Analysis confidence

Integration with Word Validation

Both engines are integrated into the word validation pipeline. When WordValidator encounters an OOV word, it checks compound resolution and reduplication before flagging a spelling error:
from myspellchecker.core import SpellCheckerBuilder

checker = (
    SpellCheckerBuilder()
    .with_compound_resolver(True)
    .with_reduplication(True)
    .build()
)

# "ကျောင်းသား" won't be flagged as a spelling error
# because CompoundResolver validates it as N+N compound
result = checker.check("ကျောင်းသားတွေ")

See Also