Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
In Myanmar, many common errors involve swapping visually or phonetically similar words that are both valid dictionary entries. The homophone checker uses N-gram context probabilities to flag these “real-word errors” and suggest the contextually correct alternative.
Overview
Myanmar homophones often arise from:
| Confusion Type | Example | Description |
|---|
| Medials | ျ vs ြ | Ya-pin vs Ya-yit |
| Finals | န် vs ံ vs မ် | Na-that vs Thay-thay-tin vs Ma-that |
| Vowels | ိ vs ည် | Similar sounds in context |
| Tone marks | ကား vs ကာ | Different meanings |
HomophoneChecker
Constructor
HomophoneChecker(
config_path=None, # Path to grammar rules config
homophone_map=None, # Optional override map (dict of word → set of homophones)
provider=None, # DictionaryProvider for DB confusable_pairs lookup
)
Basic Usage
from myspellchecker.core.homophones import HomophoneChecker
checker = HomophoneChecker()
# Get homophones for a word
homophones = checker.get_homophones("ကား")
print(homophones) # ["ကာ"]
# Check if word has homophones (use get_homophones)
has_homophones = len(checker.get_homophones("ကား")) > 0
print(has_homophones) # True
Common Homophone Pairs
| Word 1 | Word 2 | Meanings |
|---|
| ကား | ကာ | car vs to protect |
| ကျောင်း | ကြောင်း | school vs reason |
| ကျွန် | ကြွန် | servant vs (medial confusion) |
Custom Homophone Map
# Use custom homophone map
custom_map = {
"ကား": ["ကာ"],
"ကာ": ["ကား"],
"ကျောင်း": ["ကြောင်း"],
"ကြောင်း": ["ကျောင်း"],
}
checker = HomophoneChecker(homophone_map=custom_map)
With Provider (DB Confusable Pairs)
from myspellchecker.providers import SQLiteProvider
# Merge YAML homophones with DB confusable_pairs table
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = HomophoneChecker(provider=provider)
The provider parameter enables DB-driven confusable pair lookup via get_confusable_pairs(). The DB source provides ~21K pairs (aspiration, medial, nasal, tone swaps) and is the primary source. The YAML source is a curated fallback for pairs that corpus mining cannot discover.
Load from Config
# Load from grammar config
checker = HomophoneChecker(config_path="/path/to/config")
# Or use default config (loads from rules/homophones.yaml)
checker = HomophoneChecker()
Homophone Validation Strategy
The HomophoneValidationStrategy uses context to detect homophone errors:
from myspellchecker.core.validation_strategies.homophone_strategy import (
HomophoneValidationStrategy
)
strategy = HomophoneValidationStrategy(
homophone_checker=checker,
provider=ngram_provider,
context_checker=context_checker, # NgramContextChecker instance
confidence=0.80,
)
Configuration Parameters
| Parameter | Default | Description |
|---|
homophone_checker | Required | HomophoneChecker instance for homophone lookups. If None, strategy is disabled. |
provider | Required | DictionaryProvider for word frequency lookups |
context_checker | None | NgramContextChecker that performs N-gram comparison via check_word_in_context() |
confidence | 0.8 | Confidence score assigned to homophone errors |
Improvement ratios and probability thresholds are managed internally by NgramContextChecker.compute_required_ratio(), not passed directly to the strategy constructor.
How It Works
- For each word, check if it has homophones
- Analyze surrounding context (N-gram probabilities)
- If a homophone has higher probability in context, flag as error
- Suggest the contextually appropriate homophone
Minimum Probability Threshold
The NgramContextChecker applies a minimum probability threshold internally to prevent false positives from infrequent N-gram occurrences:
# When current word has zero probability (unseen n-gram):
# - Without threshold: ANY positive probability triggers suggestion
# - With threshold: Only probabilities above the minimum trigger suggestion
# For example, with a threshold of 0.001:
# Homophone with prob 0.01 → suggested (above threshold)
# Homophone with prob 0.0001 → NOT suggested (below threshold)
This prevents false suggestions when a homophone appears rarely in the training data.
Example Detection
from myspellchecker.core.validation_strategies.base import ValidationContext
# "ကား သွား" (went by car) vs "ကာ သွား" (shield went — nonsensical)
# Context suggests "ကား" (car) is the correct word here
words = ["ကာ", "သွား", "တယ်"]
context = ValidationContext(
sentence="ကာ သွား တယ်",
words=words,
word_positions=[0, 3, 8] # Unicode code point offsets
)
errors = strategy.validate(context)
# May suggest "ကား" instead of "ကာ" based on context
Integration with SpellChecker
Homophone checking is automatically enabled with context validation:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider
config = SpellCheckerConfig(
use_context_checker=True # Enables homophone detection
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check("ကာ သွား တယ်")
# Homophone errors have type "homophone_error"
for error in result.errors:
if error.error_type == "homophone_error":
print(f"{error.text} → {error.suggestions[0]}")
Homophones YAML Configuration
Homophones are defined in rules/homophones.yaml:
version: "1.1.0"
category: "homophones"
homophones:
# Each entry maps a word to its homophones (simple list format)
"ကား": ["ကာ"] # car vs protect/shield
"ကာ": ["ကား"]
"ကျောင်း": ["ကြောင်း"] # school vs reason
"ကြောင်း": ["ကျောင်း"]
"ကံ": ["ကန်", "ကင်"] # luck vs kick vs (rare)
"ကန်": ["ကံ", "ကင်"]
Context disambiguation is handled automatically via N-gram probabilities at the strategy level, so no per-entry disambiguation context is needed in the YAML.
Structure
| Field | Description |
|---|
homophones | Map of word → list of homophones |
version | Schema version |
metadata | Entry count, dates, source notes |
Best Practices
- Enable with context: Homophones need context for accurate detection
- Review suggestions: Homophone detection has moderate confidence
- Add domain-specific pairs: Extend homophones.yaml for your domain
- Use with N-grams: N-gram probabilities improve accuracy
- Homophone lookup: O(1) hash table
- Context analysis: Depends on N-gram checker
- Memory: Minimal (homophone map is small)
See Also