In Myanmar, many common errors involve swapping visually or phonetically similar words that are both valid dictionary entries. The homophone checker uses N-gram context probabilities to flag these “real-word errors” and suggest the contextually correct alternative.
Overview
Myanmar homophones often arise from:
| Confusion Type | Example | Description |
|---|
| Medials | ျ vs ြ | Ya-pin vs Ya-yit |
| Finals | န် vs ံ vs မ် | Na-that vs Thay-thay-tin vs Ma-that |
| Vowels | ိ vs ည် | Similar sounds in context |
| Tone marks | ကား vs ကာ | Different meanings |
HomophoneChecker
Constructor
HomophoneChecker(
config_path=None, # Path to grammar rules config
homophone_map=None, # Optional override map (dict of word → set of homophones)
provider=None, # DictionaryProvider for DB confusable_pairs lookup
)
Basic Usage
from myspellchecker.core.homophones import HomophoneChecker
checker = HomophoneChecker()
# Get homophones for a word
homophones = checker.get_homophones("ကား")
print(homophones) # ["ကာ"]
# Check if word has homophones (use get_homophones)
has_homophones = len(checker.get_homophones("ကား")) > 0
print(has_homophones) # True
Common Homophone Pairs
| Word 1 | Word 2 | Meanings |
|---|
| ကား | ကာ | car vs to protect |
| ကျောင်း | ကြောင်း | school vs reason |
| ကျွန် | ကြွန် | servant vs (medial confusion) |
Custom Homophone Map
# Use custom homophone map
custom_map = {
"ကား": ["ကာ"],
"ကာ": ["ကား"],
"ကျောင်း": ["ကြောင်း"],
"ကြောင်း": ["ကျောင်း"],
}
checker = HomophoneChecker(homophone_map=custom_map)
With Provider (DB Confusable Pairs)
from myspellchecker.providers import SQLiteProvider
# Merge YAML homophones with DB confusable_pairs table
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = HomophoneChecker(provider=provider)
The provider parameter enables DB-driven confusable pair lookup via get_confusable_pairs(). The DB source provides ~21K pairs (aspiration, medial, nasal, tone swaps) and is the primary source. The YAML source is a curated fallback for pairs that corpus mining cannot discover.
Load from Config
# Load from grammar config
checker = HomophoneChecker(config_path="/path/to/config")
# Or use default config (loads from rules/homophones.yaml)
checker = HomophoneChecker()
Homophone Validation Strategy
The HomophoneValidationStrategy uses context to detect homophone errors:
from myspellchecker.core.validation_strategies.homophone_strategy import (
HomophoneValidationStrategy
)
strategy = HomophoneValidationStrategy(
homophone_checker=checker,
provider=ngram_provider,
context_checker=context_checker, # NgramContextChecker instance
confidence=0.80,
)
Configuration Parameters
| Parameter | Default | Description |
|---|
homophone_checker | Required | HomophoneChecker instance for homophone lookups. If None, strategy is disabled. |
provider | Required | DictionaryProvider for word frequency lookups |
context_checker | None | NgramContextChecker that performs N-gram comparison via check_word_in_context() |
confidence | 0.8 | Confidence score assigned to homophone errors |
Improvement ratios and probability thresholds are managed internally by NgramContextChecker.compute_required_ratio(), not passed directly to the strategy constructor.
How It Works
- For each word, check if it has homophones
- Analyze surrounding context (N-gram probabilities)
- If a homophone has higher probability in context, flag as error
- Suggest the contextually appropriate homophone
Minimum Probability Threshold
The NgramContextChecker applies a minimum probability threshold internally to prevent false positives from infrequent N-gram occurrences:
# When current word has zero probability (unseen n-gram):
# - Without threshold: ANY positive probability triggers suggestion
# - With threshold: Only probabilities above the minimum trigger suggestion
# For example, with a threshold of 0.001:
# Homophone with prob 0.01 → suggested (above threshold)
# Homophone with prob 0.0001 → NOT suggested (below threshold)
This prevents false suggestions when a homophone appears rarely in the training data.
Example Detection
from myspellchecker.core.validation_strategies.base import ValidationContext
# "ကား သွား" (went by car) vs "ကာ သွား" (shield went — nonsensical)
# Context suggests "ကား" (car) is the correct word here
words = ["ကာ", "သွား", "တယ်"]
context = ValidationContext(
sentence="ကာ သွား တယ်",
words=words,
word_positions=[0, 3, 8] # Unicode code point offsets
)
errors = strategy.validate(context)
# May suggest "ကား" instead of "ကာ" based on context
Integration with SpellChecker
Homophone checking is automatically enabled with context validation:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider
config = SpellCheckerConfig(
use_context_checker=True # Enables homophone detection
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ကာ သွား တယ်")
# Homophone errors have type "homophone_error"
for error in result.errors:
if error.error_type == "homophone_error":
print(f"{error.text} → {error.suggestions[0]}")
Homophones YAML Configuration
Homophones are defined in rules/homophones.yaml:
version: "1.1.0"
category: "homophones"
homophones:
# Each entry maps a word to its homophones (simple list format)
"ကား": ["ကာ"] # car vs protect/shield
"ကာ": ["ကား"]
"ကျောင်း": ["ကြောင်း"] # school vs reason
"ကြောင်း": ["ကျောင်း"]
"ကံ": ["ကန်", "ကင်"] # luck vs kick vs (rare)
"ကန်": ["ကံ", "ကင်"]
Context disambiguation is handled automatically via N-gram probabilities at the strategy level, so no per-entry disambiguation context is needed in the YAML.
Structure
| Field | Description |
|---|
homophones | Map of word → list of homophones |
version | Schema version |
metadata | Entry count, dates, source notes |
Best Practices
- Enable with context: Homophones need context for accurate detection
- Review suggestions: Homophone detection has moderate confidence
- Add domain-specific pairs: Extend homophones.yaml for your domain
- Use with N-grams: N-gram probabilities improve accuracy
- Homophone lookup: O(1) hash table
- Context analysis: Depends on N-gram checker
- Memory: Minimal (homophone map is small)
See Also