Word Validation - mySpellChecker

After syllable-level checks pass, assembled syllable sequences are looked up against the dictionary. Unknown words get correction suggestions generated via the SymSpell algorithm, ranked by edit distance and frequency.

How It Works

Syllable Assembly

After syllable validation, valid syllables are assembled into potential words:

syllables = ["မြန်", "မာ", "နိုင်", "ငံ"]
# Assembled to: ["မြန်မာ", "နိုင်ငံ"]

Dictionary Lookup

Assembled words are checked against the word dictionary:

"မြန်မာ" → Valid (in dictionary)
"နိုင်ငံ" → Valid (in dictionary)
"xyz" → Invalid (not in dictionary)

Suggestion Generation

For invalid words, SymSpell generates suggestions in O(1) time:

"နိူင်ငံ" → Suggestions: ["နိုင်ငံ"] (edit distance 1)

SymSpell Algorithm

mySpellChecker uses the Symmetric Delete algorithm for fast suggestions:

Traditional Approach (Slow)

For each dictionary word:
    Calculate edit distance to input
    If distance ≤ max_distance:
        Add to suggestions
# Complexity: O(n * m) where n=dictionary size, m=word length

SymSpell Approach (Fast)

Pre-compute all delete variants of dictionary words
Store in hash table

For lookup:
    Generate delete variants of input
    Look up in hash table
    Return matches
# Complexity: O(1) average lookup

Why It’s Fast

Operation	Traditional	SymSpell
Single lookup	O(n × m)	O(1)
Scales with dictionary size	Slow (linear)	Very Fast (constant)

Configuration

Enable Word Validation

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
from myspellchecker.core.constants import ValidationLevel

# Create spell checker
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)

# Word-level validation (includes syllable) is specified per-check
result = checker.check(text, level=ValidationLevel.WORD)

Suggestion Settings

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    # Maximum suggestions per error
    max_suggestions=10,

    # Maximum edit distance for suggestions
    max_edit_distance=2,

    # Include phonetically similar suggestions
    use_phonetic=True,
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

SymSpell Configuration

from myspellchecker.algorithms.symspell import SymSpell
from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider("dictionary.db")
symspell = SymSpell(
    provider,
    max_edit_distance=2,  # Max edit distance for suggestions
    prefix_length=10,  # Prefix length for optimization (default: 10)
    count_threshold=1,  # Min frequency threshold
)
symspell.build_index(["syllable", "word"])  # Build the index

Word Error Types

Unknown Word

Word not found in dictionary:

result = checker.check("အသစ်စက်စက်")
# Error: WordError for unrecognized compound

Misspelled Word

Word is close to a valid dictionary entry:

result = checker.check("နိူင်ငံ")  # Typo
# Error: WordError with suggestion "နိုင်ငံ"

Compound Error

Multiple syllable errors forming invalid word:

result = checker.check("မယ်နမာ")  # Multiple errors
# Error: WordError with suggestions based on similar compounds

Morphological Synthesis

Before generating errors, word validation checks if an OOV word is a productive formation from known dictionary words. This suppresses false positives on valid compounds and reduplications.

Reduplication Validation

Myanmar creates valid words through reduplication (repeating syllables for emphasis):

# These OOV words are accepted as valid reduplications:
"ကောင်းကောင်း"  # AA: ကောင်း + ကောင်း ("well", from "good")
"သေသေချာချာ"    # AABB: သေ + သေ + ချာ + ချာ ("carefully")

Supported patterns: AA, AABB, ABAB, RHYME (known pairs). Safeguards: base must be in dictionary, frequency >= 5, POS must be V/ADJ/ADV/N.

Compound Word Synthesis

Myanmar forms compounds by joining morphemes:

# These OOV words are accepted as valid compounds:
"ကျောင်းသား"    # N+N: ကျောင်း (school) + သား (child) = "student"
"စားသောက်"       # V+V: စား (eat) + သောက် (drink) = "eating and drinking"

Uses dynamic programming to find optimal splits. Allowed patterns: N+N, V+V, N+V, V+N, ADJ+N. Blocked: P+P, P+N, N+P.

Morpheme-Level Suggestions

When a compound word has a typo in one morpheme, the suggestion engine corrects that specific morpheme instead of suggesting unrelated words:

# Input: "ကျောင်းသာ" (typo: သာ should be သား)
# Morpheme strategy detects: ကျောင်း is valid, သာ is not
# Corrects: သာ → သား via SymSpell
# Suggests: "ကျောင်းသား"

Configuration

Enable/disable morphological synthesis in ValidationConfig. Tune algorithm parameters in the dedicated CompoundResolverConfig and ReduplicationConfig:

from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig
from myspellchecker.core.config.algorithm_configs import (
    CompoundResolverConfig,
    ReduplicationConfig,
)

config = SpellCheckerConfig(
    validation=ValidationConfig(
        use_reduplication_validation=True,   # Default: True
        use_compound_synthesis=True,         # Default: True
    ),
    # Algorithm-level tuning for compound resolution
    compound_resolver=CompoundResolverConfig(
        min_morpheme_frequency=10,           # Min frequency per morpheme
        max_parts=4,                         # Max compound parts
    ),
    # Algorithm-level tuning for reduplication detection
    reduplication=ReduplicationConfig(
        min_base_frequency=5,                # Min base word frequency
    ),
)

Suggestion Ranking

The DefaultRanker scores suggestions using a multi-factor formula where lower scores indicate better suggestions:

score = (edit_distance × plausibility) - freq_bonus - phonetic_bonus
        - nasal_bonus - same_nasal_bonus - pos_bonus - span_bonus

The base score starts at the edit distance, optionally scaled by a plausibility multiplier derived from Myanmar-weighted substitution costs (e.g., aspirated pairs and medial confusions get lower costs). Then several bonuses are subtracted:

Factor	Effect	Description
Frequency bonus	Up to configurable ceiling	Asymptotic bonus based on corpus frequency
Phonetic bonus	Configurable weight	Rewards phonetically similar suggestions
Nasal bonus	Fixed weight	Rewards nasal variant matches (န် / ံ)
Same nasal bonus	Fixed weight	Rewards same nasal ending as input
POS fit bonus	Configurable weight	Rewards grammatically fitting suggestions (via POS bigrams)
Span bonus	Length-scaled	Prefers suggestions matching the error span length

All weights are configurable via RankerConfig. Alternative rankers (FrequencyFirstRanker, PhoneticFirstRanker, EditDistanceOnlyRanker) emphasize different factors. See Suggestion Ranking for the full algorithm details.

Frequency-Based Ranking

# Higher frequency words rank higher
"နိုင်ငံ" (freq: 50000) → Higher rank
"နှိုင်ငံ" (freq: 100) → Lower rank

Edit Distance Ranking

# Lower edit distance ranks higher
"နိုင်ငံ" (distance: 1) → Higher rank
"နိမ်ငံ" (distance: 2) → Lower rank

Performance Characteristics

Metric	Value
Speed	Fast
Lookup Complexity	O(1) average
Suggestion Generation	O(k) where k = candidates

Word validation is fast thanks to SymSpell’s pre-computed delete index. Memory usage scales with dictionary size.

API Reference

Using SpellChecker for Word Validation

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
from myspellchecker.core.constants import ValidationLevel

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)

# Validate text at word level
result = checker.check("မြန်မာနိုင်ငံ", level=ValidationLevel.WORD)

# Check for word-level errors
for error in result.errors:
    print(f"Error: {error.text}")
    print(f"Suggestions: {error.suggestions}")
    print(f"Confidence: {error.confidence}")

# Check if text is valid
print(f"Has errors: {result.has_errors}")

Note: Direct instantiation of WordValidator requires a DI container setup. For most use cases, use SpellChecker.check() with level=ValidationLevel.WORD.

SymSpell Interface

from myspellchecker.algorithms.symspell import SymSpell
from myspellchecker.providers import SQLiteProvider

# Initialize with provider
provider = SQLiteProvider("dictionary.db")
symspell = SymSpell(provider, max_edit_distance=2)

# Build index before lookups
symspell.build_index(["word"])

# Lookup word
suggestions = symspell.lookup("နိူင်ငံ", level="word", max_suggestions=5)
for suggestion in suggestions:
    print(f"{suggestion.term}: distance={suggestion.edit_distance}, freq={suggestion.frequency}")

# Compound lookup (for multi-word)
results = symspell.lookup_compound("နိူင် ငံ", max_edit_distance=2)

Common Patterns

Custom Word List

from myspellchecker import SpellChecker
from myspellchecker.providers import MemoryProvider

# Use MemoryProvider to add custom words
provider = MemoryProvider()

# Add domain-specific words
custom_words = ["အိုင်တီ", "ဆော့ဖ်ဝဲ", "ဒေတာဘေ့စ်"]
for word in custom_words:
    provider.add_word(word, frequency=1000)

# Create checker with custom provider
checker = SpellChecker(provider=provider)

Ignore Unknown Words

def check_with_ignore_list(text: str, ignore_words: set) -> list:
    """Check text, ignoring specified words."""
    result = checker.check(text)

    return [
        error for error in result.errors
        if error.text not in ignore_words
    ]

# Usage
ignore = {"အိုင်တီ", "API", "HTTP"}
errors = check_with_ignore_list("API ကို သုံး", ignore)

Get Top Suggestions Only

def get_best_suggestion(word: str) -> str | None:
    """Get the single best suggestion for a word."""
    result = checker.check(word)

    if result.has_errors and result.errors[0].suggestions:
        return result.errors[0].suggestions[0]
    return None

Troubleshooting

Issue: Valid words marked as errors

Cause: Word not in dictionary Solution: Add to dictionary:

myspellchecker build --input new_words.txt --output dictionary.db --incremental

Issue: Poor suggestions

Cause: Low corpus frequency or missing similar words Solution: Improve corpus quality or adjust settings:

config = SpellCheckerConfig(
    max_edit_distance=3,  # Allow more distance
    use_phonetic=True,  # Enable phonetic matching
)

Issue: Slow suggestion generation

Cause: Large edit distance or dictionary Solution: Reduce max_edit_distance:

from myspellchecker.providers import SQLiteProvider
config = SpellCheckerConfig(max_edit_distance=1)  # Faster
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

Next Steps

Context Checking - Detect real-word errors
SymSpell Algorithm - Deep dive into SymSpell
Performance Tuning - Optimization strategies

Documentation Index

​How It Works

​Syllable Assembly

​Dictionary Lookup

​Suggestion Generation

​SymSpell Algorithm

​Traditional Approach (Slow)

​SymSpell Approach (Fast)

​Why It’s Fast

​Configuration

​Enable Word Validation

​Suggestion Settings

​SymSpell Configuration

​Word Error Types

​Unknown Word

​Misspelled Word

​Compound Error

​Morphological Synthesis

​Reduplication Validation

​Compound Word Synthesis

​Morpheme-Level Suggestions

​Configuration

​Suggestion Ranking

​Frequency-Based Ranking

​Edit Distance Ranking

​Performance Characteristics

​API Reference

​Using SpellChecker for Word Validation

​SymSpell Interface

​Common Patterns

​Custom Word List

​Ignore Unknown Words

​Get Top Suggestions Only

​Troubleshooting

​Issue: Valid words marked as errors

​Issue: Poor suggestions

​Issue: Slow suggestion generation

​Next Steps

How It Works

Syllable Assembly

Dictionary Lookup

Suggestion Generation

SymSpell Algorithm

Traditional Approach (Slow)

SymSpell Approach (Fast)

Why It’s Fast

Configuration

Enable Word Validation

Suggestion Settings

SymSpell Configuration

Word Error Types

Unknown Word

Misspelled Word

Compound Error

Morphological Synthesis

Reduplication Validation

Compound Word Synthesis

Morpheme-Level Suggestions

Configuration

Suggestion Ranking

Frequency-Based Ranking

Edit Distance Ranking

Performance Characteristics

API Reference

Using SpellChecker for Word Validation

SymSpell Interface

Common Patterns

Custom Word List

Ignore Unknown Words

Get Top Suggestions Only

Troubleshooting

Issue: Valid words marked as errors

Issue: Poor suggestions

Issue: Slow suggestion generation

Next Steps