Homophones Detection - mySpellChecker

In Myanmar, many common errors involve swapping visually or phonetically similar words that are both valid dictionary entries. The homophone checker uses N-gram context probabilities to flag these “real-word errors” and suggest the contextually correct alternative.

Overview

Myanmar homophones often arise from:

Confusion Type	Example	Description
Medials	ျ vs ြ	Ya-pin vs Ya-yit
Finals	န် vs ံ vs မ်	Na-that vs Thay-thay-tin vs Ma-that
Vowels	ိ vs ည်	Similar sounds in context
Tone marks	ကား vs ကာ	Different meanings

HomophoneChecker

Constructor

HomophoneChecker(
    config_path=None,      # Path to grammar rules config
    homophone_map=None,    # Optional override map (dict of word → set of homophones)
    provider=None,         # DictionaryProvider for DB confusable_pairs lookup
)

Basic Usage

from myspellchecker.core.homophones import HomophoneChecker

checker = HomophoneChecker()

# Get homophones for a word
homophones = checker.get_homophones("ကား")
print(homophones)  # ["ကာ"]

# Check if word has homophones (use get_homophones)
has_homophones = len(checker.get_homophones("ကား")) > 0
print(has_homophones)  # True

Common Homophone Pairs

Word 1	Word 2	Meanings
ကား	ကာ	car vs to protect
ကျောင်း	ကြောင်း	school vs reason
ကျွန်	ကြွန်	servant vs (medial confusion)

Custom Homophone Map

# Use custom homophone map
custom_map = {
    "ကား": ["ကာ"],
    "ကာ": ["ကား"],
    "ကျောင်း": ["ကြောင်း"],
    "ကြောင်း": ["ကျောင်း"],
}

checker = HomophoneChecker(homophone_map=custom_map)

With Provider (DB Confusable Pairs)

from myspellchecker.providers import SQLiteProvider

# Merge YAML homophones with DB confusable_pairs table
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = HomophoneChecker(provider=provider)

The provider parameter enables DB-driven confusable pair lookup via get_confusable_pairs(). The DB source provides ~21K pairs (aspiration, medial, nasal, tone swaps) and is the primary source. The YAML source is a curated fallback for pairs that corpus mining cannot discover.

Load from Config

# Load from grammar config
checker = HomophoneChecker(config_path="/path/to/config")

# Or use default config (loads from rules/homophones.yaml)
checker = HomophoneChecker()

Homophone Validation Strategy

The HomophoneValidationStrategy uses context to detect homophone errors:

from myspellchecker.core.validation_strategies.homophone_strategy import (
    HomophoneValidationStrategy
)

strategy = HomophoneValidationStrategy(
    homophone_checker=checker,
    provider=ngram_provider,
    context_checker=context_checker,  # NgramContextChecker instance
    confidence=0.80,
)

Configuration Parameters

Parameter	Default	Description
`homophone_checker`	Required	HomophoneChecker instance for homophone lookups. If `None`, strategy is disabled.
`provider`	Required	DictionaryProvider for word frequency lookups
`context_checker`	`None`	NgramContextChecker that performs N-gram comparison via `check_word_in_context()`
`confidence`	0.8	Confidence score assigned to homophone errors

Improvement ratios and probability thresholds are managed internally by NgramContextChecker.compute_required_ratio(), not passed directly to the strategy constructor.

How It Works

For each word, check if it has homophones
Analyze surrounding context (N-gram probabilities)
If a homophone has higher probability in context, flag as error
Suggest the contextually appropriate homophone

Minimum Probability Threshold

The NgramContextChecker applies a minimum probability threshold internally to prevent false positives from infrequent N-gram occurrences:

# When current word has zero probability (unseen n-gram):
# - Without threshold: ANY positive probability triggers suggestion
# - With threshold: Only probabilities above the minimum trigger suggestion

# For example, with a threshold of 0.001:
# Homophone with prob 0.01   → suggested (above threshold)
# Homophone with prob 0.0001 → NOT suggested (below threshold)

This prevents false suggestions when a homophone appears rarely in the training data.

Example Detection

from myspellchecker.core.validation_strategies.base import ValidationContext

# "ကား သွား" (went by car) vs "ကာ သွား" (shield went — nonsensical)
# Context suggests "ကား" (car) is the correct word here

words = ["ကာ", "သွား", "တယ်"]
context = ValidationContext(
    sentence="ကာ သွား တယ်",
    words=words,
    word_positions=[0, 3, 8]  # Unicode code point offsets
)

errors = strategy.validate(context)
# May suggest "ကား" instead of "ကာ" based on context

Integration with SpellChecker

Homophone checking is automatically enabled with context validation:

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    use_context_checker=True  # Enables homophone detection
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ကာ သွား တယ်")

# Homophone errors have type "homophone_error"
for error in result.errors:
    if error.error_type == "homophone_error":
        print(f"{error.text} → {error.suggestions[0]}")

Homophones YAML Configuration

Homophones are defined in rules/homophones.yaml:

version: "1.1.0"
category: "homophones"

homophones:
  # Each entry maps a word to its homophones (simple list format)
  "ကား": ["ကာ"]            # car vs protect/shield
  "ကာ": ["ကား"]
  "ကျောင်း": ["ကြောင်း"]  # school vs reason
  "ကြောင်း": ["ကျောင်း"]
  "ကံ": ["ကန်", "ကင်"]    # luck vs kick vs (rare)
  "ကန်": ["ကံ", "ကင်"]

Context disambiguation is handled automatically via N-gram probabilities at the strategy level, so no per-entry disambiguation context is needed in the YAML.

Structure

Field	Description
`homophones`	Map of word → list of homophones
`version`	Schema version
`metadata`	Entry count, dates, source notes

Best Practices

Enable with context: Homophones need context for accurate detection
Review suggestions: Homophone detection has moderate confidence
Add domain-specific pairs: Extend homophones.yaml for your domain
Use with N-grams: N-gram probabilities improve accuracy

Performance

Homophone lookup: O(1) hash table
Context analysis: Depends on N-gram checker
Memory: Minimal (homophone map is small)

​Overview

​HomophoneChecker

​Constructor

​Basic Usage

​Common Homophone Pairs

​Custom Homophone Map

​With Provider (DB Confusable Pairs)

​Load from Config

​Homophone Validation Strategy

​Configuration Parameters

​How It Works

​Minimum Probability Threshold

​Example Detection

​Integration with SpellChecker

​Homophones YAML Configuration

​Structure

​Best Practices

​Performance

​See Also

Overview

HomophoneChecker

Constructor

Basic Usage

Common Homophone Pairs

Custom Homophone Map

With Provider (DB Confusable Pairs)

Load from Config

Homophone Validation Strategy

Configuration Parameters

How It Works

Minimum Probability Threshold

Example Detection

Integration with SpellChecker

Homophones YAML Configuration

Structure

Best Practices

Performance

See Also