Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt

Use this file to discover all available pages before exploring further.

In Myanmar, many common errors involve swapping visually or phonetically similar words that are both valid dictionary entries. The homophone checker uses N-gram context probabilities to flag these “real-word errors” and suggest the contextually correct alternative.

Overview

Myanmar homophones often arise from:
Confusion TypeExampleDescription
Medialsျ vs ြYa-pin vs Ya-yit
Finalsန် vs ံ vs မ်Na-that vs Thay-thay-tin vs Ma-that
Vowelsိ vs ည်Similar sounds in context
Tone marksကား vs ကာDifferent meanings

HomophoneChecker

Constructor

HomophoneChecker(
    config_path=None,      # Path to grammar rules config
    homophone_map=None,    # Optional override map (dict of word → set of homophones)
    provider=None,         # DictionaryProvider for DB confusable_pairs lookup
)

Basic Usage

from myspellchecker.core.homophones import HomophoneChecker

checker = HomophoneChecker()

# Get homophones for a word
homophones = checker.get_homophones("ကား")
print(homophones)  # ["ကာ"]

# Check if word has homophones (use get_homophones)
has_homophones = len(checker.get_homophones("ကား")) > 0
print(has_homophones)  # True

Common Homophone Pairs

Word 1Word 2Meanings
ကားကာcar vs to protect
ကျောင်းကြောင်းschool vs reason
ကျွန်ကြွန်servant vs (medial confusion)

Custom Homophone Map

# Use custom homophone map
custom_map = {
    "ကား": ["ကာ"],
    "ကာ": ["ကား"],
    "ကျောင်း": ["ကြောင်း"],
    "ကြောင်း": ["ကျောင်း"],
}

checker = HomophoneChecker(homophone_map=custom_map)

With Provider (DB Confusable Pairs)

from myspellchecker.providers import SQLiteProvider

# Merge YAML homophones with DB confusable_pairs table
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = HomophoneChecker(provider=provider)
The provider parameter enables DB-driven confusable pair lookup via get_confusable_pairs(). The DB source provides ~21K pairs (aspiration, medial, nasal, tone swaps) and is the primary source. The YAML source is a curated fallback for pairs that corpus mining cannot discover.

Load from Config

# Load from grammar config
checker = HomophoneChecker(config_path="/path/to/config")

# Or use default config (loads from rules/homophones.yaml)
checker = HomophoneChecker()

Homophone Validation Strategy

The HomophoneValidationStrategy uses context to detect homophone errors:
from myspellchecker.core.validation_strategies.homophone_strategy import (
    HomophoneValidationStrategy
)

strategy = HomophoneValidationStrategy(
    homophone_checker=checker,
    provider=ngram_provider,
    context_checker=context_checker,  # NgramContextChecker instance
    confidence=0.80,
)

Configuration Parameters

ParameterDefaultDescription
homophone_checkerRequiredHomophoneChecker instance for homophone lookups. If None, strategy is disabled.
providerRequiredDictionaryProvider for word frequency lookups
context_checkerNoneNgramContextChecker that performs N-gram comparison via check_word_in_context()
confidence0.8Confidence score assigned to homophone errors
Improvement ratios and probability thresholds are managed internally by NgramContextChecker.compute_required_ratio(), not passed directly to the strategy constructor.

How It Works

  1. For each word, check if it has homophones
  2. Analyze surrounding context (N-gram probabilities)
  3. If a homophone has higher probability in context, flag as error
  4. Suggest the contextually appropriate homophone

Minimum Probability Threshold

The NgramContextChecker applies a minimum probability threshold internally to prevent false positives from infrequent N-gram occurrences:
# When current word has zero probability (unseen n-gram):
# - Without threshold: ANY positive probability triggers suggestion
# - With threshold: Only probabilities above the minimum trigger suggestion

# For example, with a threshold of 0.001:
# Homophone with prob 0.01   → suggested (above threshold)
# Homophone with prob 0.0001 → NOT suggested (below threshold)
This prevents false suggestions when a homophone appears rarely in the training data.

Example Detection

from myspellchecker.core.validation_strategies.base import ValidationContext

# "ကား သွား" (went by car) vs "ကာ သွား" (shield went — nonsensical)
# Context suggests "ကား" (car) is the correct word here

words = ["ကာ", "သွား", "တယ်"]
context = ValidationContext(
    sentence="ကာ သွား တယ်",
    words=words,
    word_positions=[0, 3, 8]  # Unicode code point offsets
)

errors = strategy.validate(context)
# May suggest "ကား" instead of "ကာ" based on context

Integration with SpellChecker

Homophone checking is automatically enabled with context validation:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    use_context_checker=True  # Enables homophone detection
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ကာ သွား တယ်")

# Homophone errors have type "homophone_error"
for error in result.errors:
    if error.error_type == "homophone_error":
        print(f"{error.text}{error.suggestions[0]}")

Homophones YAML Configuration

Homophones are defined in rules/homophones.yaml:
version: "1.1.0"
category: "homophones"

homophones:
  # Each entry maps a word to its homophones (simple list format)
  "ကား": ["ကာ"]            # car vs protect/shield
  "ကာ": ["ကား"]
  "ကျောင်း": ["ကြောင်း"]  # school vs reason
  "ကြောင်း": ["ကျောင်း"]
  "ကံ": ["ကန်", "ကင်"]    # luck vs kick vs (rare)
  "ကန်": ["ကံ", "ကင်"]
Context disambiguation is handled automatically via N-gram probabilities at the strategy level, so no per-entry disambiguation context is needed in the YAML.

Structure

FieldDescription
homophonesMap of word → list of homophones
versionSchema version
metadataEntry count, dates, source notes

Best Practices

  1. Enable with context: Homophones need context for accurate detection
  2. Review suggestions: Homophone detection has moderate confidence
  3. Add domain-specific pairs: Extend homophones.yaml for your domain
  4. Use with N-grams: N-gram probabilities improve accuracy

Performance

  • Homophone lookup: O(1) hash table
  • Context analysis: Depends on N-gram checker
  • Memory: Minimal (homophone map is small)

See Also