Skip to main content
mySpellChecker includes a homophone checker to detect β€œReal-Word Errors” - words that are spelled correctly but confused with similar-sounding words.

Overview

Myanmar homophones often arise from:
Confusion TypeExampleDescription
Medialsα€» vs α€ΌYa-pin vs Ya-yit
Finalsα€”α€Ί vs α€Ά vs α€™α€ΊNa-that vs Thay-thay-tin vs Ma-that
Vowelsα€­ vs α€Šα€ΊSimilar sounds in context
Tone marksကား vs ကာDifferent meanings

HomophoneChecker

Basic Usage

from myspellchecker.core.homophones import HomophoneChecker

checker = HomophoneChecker()

# Get homophones for a word
homophones = checker.get_homophones("ကား")
print(homophones)  # {"ကာ"}

# Check if word has homophones
has_homophones = checker.has_homophone("ကား")
print(has_homophones)  # True

Common Homophone Pairs

Word 1Word 2Meanings
ကားကာcar vs shield/screen
α€žα€¬α€žα€¬α€Έmerely/pleasant vs son
ကျောင်းကြောင်းschool vs reason
α€™α€Όα€”α€Ία€™α€»α€”α€Ίfast vs (incorrect)
ငါငါးI/me vs five/fish

Custom Homophone Map

# Use custom homophone map
custom_map = {
    "ကား": {"ကာ"},
    "ကာ": {"ကား"},
    "α€žα€¬": {"α€žα€¬α€Έ"},
    "α€žα€¬α€Έ": {"α€žα€¬"}
}

checker = HomophoneChecker(homophone_map=custom_map)

Load from Config

# Load from grammar config
checker = HomophoneChecker(config_path="/path/to/config")

# Or use default config (loads from rules/homophones.yaml)
checker = HomophoneChecker()

Homophone Validation Strategy

The HomophoneValidationStrategy uses context to detect homophone errors:
from myspellchecker.core.validation_strategies.homophone_strategy import (
    HomophoneValidationStrategy
)

strategy = HomophoneValidationStrategy(
    homophone_checker=checker,
    provider=ngram_provider,
    confidence=0.80,
    improvement_ratio=5.0,   # Require 5x better probability
    min_probability=0.001    # Minimum threshold to prevent false positives
)

Configuration Parameters

ParameterDefaultDescription
homophone_checkerRequiredHomophoneChecker instance for lookups
providerRequiredNgramRepository for n-gram probabilities
confidence0.8Confidence score for homophone errors
improvement_ratio5.0Minimum probability improvement ratio (5x)
min_probability0.001Minimum probability threshold
high_freq_threshold1000Word frequency above which stricter ratio applies
high_freq_improvement_ratio50.0Improvement ratio for high-frequency words (50x)

How It Works

  1. For each word, check if it has homophones
  2. Analyze surrounding context (N-gram probabilities)
  3. If a homophone has higher probability in context, flag as error
  4. Suggest the contextually appropriate homophone

Minimum Probability Threshold

The min_probability parameter prevents false positives from infrequent n-gram occurrences:
# When current word has zero probability (unseen n-gram):
# - Without threshold: ANY positive probability triggers suggestion
# - With threshold: Only probabilities >= min_probability trigger suggestion

# Example with min_probability=0.001:
# Homophone with prob 0.01   β†’ suggested (0.01 >= 0.001)
# Homophone with prob 0.0001 β†’ NOT suggested (0.0001 < 0.001)
This prevents false suggestions when a homophone appears rarely in the training data.

Example Detection

# "ကား α€žα€½α€¬α€Έ" (went by car) vs "ကာ α€žα€½α€¬α€Έ" (shield went β€” nonsensical)
# Context suggests "ကား" (car) is the correct word here

words = ["ကာ", "α€žα€½α€¬α€Έ", "α€α€šα€Ί"]
context = ValidationContext(
    sentence="ကာ α€žα€½α€¬α€Έ α€α€šα€Ί",
    words=words,
    word_positions=[0, 3, 8]  # Unicode code point offsets
)

errors = strategy.validate(context)
# May suggest "ကား" instead of "ကာ" based on context

Integration with SpellChecker

Homophone checking is automatically enabled with context validation:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    use_context_checker=True  # Enables homophone detection
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ကာ α€žα€½α€¬α€Έ α€α€šα€Ί")

# Homophone errors have type "homophone_error"
for error in result.errors:
    if error.error_type == "homophone_error":
        print(f"{error.text} β†’ {error.suggestions[0]}")

Homophones YAML Configuration

Homophones are defined in rules/homophones.yaml:
version: "1.1.0"
category: "homophones"

homophones:
  # Each entry maps a word to its homophones (simple list format)
  "ကား": ["ကာ"]            # car vs protect/shield
  "ကာ": ["ကား"]
  "ကျောင်း": ["ကြောင်း"]  # school vs reason
  "ကြောင်း": ["ကျောင်း"]
  "α€€α€Ά": ["α€€α€”α€Ί", "ကင်"]    # luck vs kick vs (rare)
  "α€€α€”α€Ί": ["α€€α€Ά", "ကင်"]
Context disambiguation is handled automatically via n-gram probabilities at the strategy level β€” no per-entry disambiguation context is needed in the YAML.

Structure

FieldDescription
homophonesMap of word β†’ list of homophones
versionSchema version
metadataEntry count, dates, source notes

Best Practices

  1. Enable with context: Homophones need context for accurate detection
  2. Review suggestions: Homophone detection has moderate confidence
  3. Add domain-specific pairs: Extend homophones.yaml for your domain
  4. Use with N-grams: N-gram probabilities improve accuracy

Performance

  • Homophone lookup: O(1) hash table
  • Context analysis: Depends on N-gram checker
  • Memory: Minimal (homophone map is small)

See Also