Skip to main content
In Myanmar, many common errors involve swapping visually or phonetically similar words that are both valid dictionary entries. The homophone checker uses N-gram context probabilities to flag these “real-word errors” and suggest the contextually correct alternative.

Overview

Myanmar homophones often arise from:
Confusion TypeExampleDescription
Medialsျ vs ြYa-pin vs Ya-yit
Finalsန် vs ံ vs မ်Na-that vs Thay-thay-tin vs Ma-that
Vowelsိ vs ည်Similar sounds in context
Tone marksကား vs ကာDifferent meanings

HomophoneChecker

Constructor

HomophoneChecker(
    config_path=None,      # Path to grammar rules config
    homophone_map=None,    # Optional override map (dict of word → set of homophones)
    provider=None,         # DictionaryProvider for DB confusable_pairs lookup
)

Basic Usage

from myspellchecker.core.homophones import HomophoneChecker

checker = HomophoneChecker()

# Get homophones for a word
homophones = checker.get_homophones("ကား")
print(homophones)  # ["ကာ"]

# Check if word has homophones (use get_homophones)
has_homophones = len(checker.get_homophones("ကား")) > 0
print(has_homophones)  # True

Common Homophone Pairs

Word 1Word 2Meanings
ကားကာcar vs to protect
ကျောင်းကြောင်းschool vs reason
ကျွန်ကြွန်servant vs (medial confusion)

Custom Homophone Map

# Use custom homophone map
custom_map = {
    "ကား": ["ကာ"],
    "ကာ": ["ကား"],
    "ကျောင်း": ["ကြောင်း"],
    "ကြောင်း": ["ကျောင်း"],
}

checker = HomophoneChecker(homophone_map=custom_map)

With Provider (DB Confusable Pairs)

from myspellchecker.providers import SQLiteProvider

# Merge YAML homophones with DB confusable_pairs table
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = HomophoneChecker(provider=provider)
The provider parameter enables DB-driven confusable pair lookup via get_confusable_pairs(). The DB source provides ~21K pairs (aspiration, medial, nasal, tone swaps) and is the primary source. The YAML source is a curated fallback for pairs that corpus mining cannot discover.

Load from Config

# Load from grammar config
checker = HomophoneChecker(config_path="/path/to/config")

# Or use default config (loads from rules/homophones.yaml)
checker = HomophoneChecker()

Homophone Validation Strategy

The HomophoneValidationStrategy uses context to detect homophone errors:
from myspellchecker.core.validation_strategies.homophone_strategy import (
    HomophoneValidationStrategy
)

strategy = HomophoneValidationStrategy(
    homophone_checker=checker,
    provider=ngram_provider,
    context_checker=context_checker,  # NgramContextChecker instance
    confidence=0.80,
)

Configuration Parameters

ParameterDefaultDescription
homophone_checkerRequiredHomophoneChecker instance for homophone lookups. If None, strategy is disabled.
providerRequiredDictionaryProvider for word frequency lookups
context_checkerNoneNgramContextChecker that performs N-gram comparison via check_word_in_context()
confidence0.8Confidence score assigned to homophone errors
Improvement ratios and probability thresholds are managed internally by NgramContextChecker.compute_required_ratio(), not passed directly to the strategy constructor.

How It Works

  1. For each word, check if it has homophones
  2. Analyze surrounding context (N-gram probabilities)
  3. If a homophone has higher probability in context, flag as error
  4. Suggest the contextually appropriate homophone

Minimum Probability Threshold

The NgramContextChecker applies a minimum probability threshold internally to prevent false positives from infrequent N-gram occurrences:
# When current word has zero probability (unseen n-gram):
# - Without threshold: ANY positive probability triggers suggestion
# - With threshold: Only probabilities above the minimum trigger suggestion

# For example, with a threshold of 0.001:
# Homophone with prob 0.01   → suggested (above threshold)
# Homophone with prob 0.0001 → NOT suggested (below threshold)
This prevents false suggestions when a homophone appears rarely in the training data.

Example Detection

from myspellchecker.core.validation_strategies.base import ValidationContext

# "ကား သွား" (went by car) vs "ကာ သွား" (shield went — nonsensical)
# Context suggests "ကား" (car) is the correct word here

words = ["ကာ", "သွား", "တယ်"]
context = ValidationContext(
    sentence="ကာ သွား တယ်",
    words=words,
    word_positions=[0, 3, 8]  # Unicode code point offsets
)

errors = strategy.validate(context)
# May suggest "ကား" instead of "ကာ" based on context

Integration with SpellChecker

Homophone checking is automatically enabled with context validation:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    use_context_checker=True  # Enables homophone detection
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ကာ သွား တယ်")

# Homophone errors have type "homophone_error"
for error in result.errors:
    if error.error_type == "homophone_error":
        print(f"{error.text}{error.suggestions[0]}")

Homophones YAML Configuration

Homophones are defined in rules/homophones.yaml:
version: "1.1.0"
category: "homophones"

homophones:
  # Each entry maps a word to its homophones (simple list format)
  "ကား": ["ကာ"]            # car vs protect/shield
  "ကာ": ["ကား"]
  "ကျောင်း": ["ကြောင်း"]  # school vs reason
  "ကြောင်း": ["ကျောင်း"]
  "ကံ": ["ကန်", "ကင်"]    # luck vs kick vs (rare)
  "ကန်": ["ကံ", "ကင်"]
Context disambiguation is handled automatically via N-gram probabilities at the strategy level, so no per-entry disambiguation context is needed in the YAML.

Structure

FieldDescription
homophonesMap of word → list of homophones
versionSchema version
metadataEntry count, dates, source notes

Best Practices

  1. Enable with context: Homophones need context for accurate detection
  2. Review suggestions: Homophone detection has moderate confidence
  3. Add domain-specific pairs: Extend homophones.yaml for your domain
  4. Use with N-grams: N-gram probabilities improve accuracy

Performance

  • Homophone lookup: O(1) hash table
  • Context analysis: Depends on N-gram checker
  • Memory: Minimal (homophone map is small)

See Also