Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The suggestion ranking system determines how spelling corrections are scored and ordered. Multiple ranking strategies are available, from simple edit distance to sophisticated multi-factor ranking.
Overview
from myspellchecker.algorithms.ranker import DefaultRanker, SuggestionData
ranker = DefaultRanker()
# Score a suggestion
data = SuggestionData(
term="မြန်မာ",
edit_distance=1,
frequency=10000,
phonetic_score=0.8,
)
score = ranker.score(data)
print(f"Score: {score}") # Lower is better
SuggestionData
All ranking input is encapsulated in SuggestionData:
from myspellchecker.algorithms.ranker import SuggestionData
data = SuggestionData(
term="မြန်မာ", # Suggested correction
edit_distance=1, # Levenshtein distance
frequency=10000, # Corpus frequency
phonetic_score=0.8, # Phonetic similarity (0-1)
syllable_distance=0.5, # Myanmar syllable-aware distance
weighted_distance=0.8, # Myanmar-weighted edit distance
is_nasal_variant=False, # True if nasal ending difference
has_same_nasal_ending=True,# Same nasal consonant
source="symspell", # Origin: symspell, particle_typo, etc.
confidence=1.0, # Source-specific confidence
strategy_score=None, # Strategy-level score (optional)
score_breakdown=None, # Debug info with component scores (optional)
)
Data Fields
| Field | Type | Description |
|---|
term | str | The suggested word |
edit_distance | int | Damerau-Levenshtein distance |
frequency | int | Word frequency in corpus |
phonetic_score | float | Phonetic similarity (0.0-1.0) |
syllable_distance | float | Myanmar syllable-aware distance |
weighted_distance | float | Myanmar-weighted edit distance using substitution costs |
is_nasal_variant | bool | Nasal ending variant (န်↔ံ) |
has_same_nasal_ending | bool | Same nasal consonant ending |
source | str | Suggestion origin |
confidence | float | Source confidence (0.0-1.0) |
strategy_score | float | Strategy-level score for blending (optional) |
score_breakdown | dict | Debug info with component scores (optional) |
pos_fit_score | float | POS bigram fit score from context analysis (optional) |
error_length | int | Character length of original error span for span-length bonus (optional) |
Ranking Strategies
DefaultRanker
Balanced ranking considering multiple factors:
from myspellchecker.algorithms.ranker import DefaultRanker
ranker = DefaultRanker()
Scoring Formula:
score = edit_distance * plausibility
- freq_bonus - phonetic_bonus - nasal_bonus
- same_nasal_bonus - pos_bonus - span_bonus
Bonuses:
| Bonus | Range | Description |
|---|
freq_bonus | 0.0-0.8 | Higher frequency reduces score: 0.8 * (1 - 1/(1 + freq/denominator)) |
phonetic_bonus | 0.0-0.4 | Phonetic similarity bonus (weight=0.4) |
nasal_bonus | 0.0-0.15 | Nasal variant matching (weight=0.15) |
same_nasal_bonus | 0.0-0.25 | Same nasal ending (weight=0.25) |
pos_bonus | 0.0-0.25 | POS bigram fit score from context (weight=0.25) |
span_bonus | 0.1-1.4 | Length-scaled bonus for error-span matching with tiered scoring |
FrequencyFirstRanker
Prioritizes common words over edit distance:
from myspellchecker.algorithms.ranker import FrequencyFirstRanker
ranker = FrequencyFirstRanker()
Scoring Formula:
score = edit_distance * edit_weight - log1p(frequency) * freq_scale
Use Case: Autocomplete-style suggestions where common words are preferred.
EditDistanceOnlyRanker
Simple ranking by edit distance only:
from myspellchecker.algorithms.ranker import EditDistanceOnlyRanker
ranker = EditDistanceOnlyRanker()
score = ranker.score(data) # Returns edit_distance directly
Use Case: Testing, debugging, or when frequency data is unavailable.
PhoneticFirstRanker
Prioritizes phonetically similar words:
from myspellchecker.algorithms.ranker import PhoneticFirstRanker
ranker = PhoneticFirstRanker()
Scoring Formula:
score = edit_distance * edit_weight - phonetic_score * phonetic_weight
Use Case: Myanmar text with common phonetic confusions (medial swaps).
UnifiedRanker
Consolidates suggestions from multiple sources:
from myspellchecker.algorithms.ranker import UnifiedRanker
ranker = UnifiedRanker()
# Score with source awareness
data = SuggestionData(
term="ကြောင်း",
edit_distance=1,
frequency=5000,
source="medial_confusion", # High-priority source
confidence=0.95,
)
score = ranker.score(data) # Boosted by source weight
Source Weights:
| Source | Default Weight | Description |
|---|
particle_typo | 1.2 | Grammar rule match |
semantic | 1.15 | Semantic model |
context | 1.15 | Context-aware re-ranking |
medial_confusion | 1.1 | Ya-pin/Ya-yit swap |
symspell | 1.0 | Statistical (baseline) |
question_structure | 1.0 | Question structure |
compound | 0.95 | Compound word splitting |
morphology | 0.9 | Morphological analysis |
morpheme | 0.85 | Morpheme-level correction |
pos_sequence | 0.85 | POS sequence |
medial_swap | 1.0 | Medial swap variants |
Configuration
RankerConfig
from myspellchecker.core.config import RankerConfig
config = RankerConfig(
# DefaultRanker parameters
frequency_denominator=10000.0,
phonetic_bonus_weight=0.4,
syllable_bonus_weight=0.3,
nasal_bonus_weight=0.15,
same_nasal_bonus_weight=0.25,
weighted_distance_bonus_weight=0.35,
# FrequencyFirstRanker parameters
frequency_first_edit_weight=0.5,
frequency_first_scale=0.1,
# PhoneticFirstRanker parameters
phonetic_first_weight=1.0,
phonetic_first_edit_weight=0.3,
# UnifiedRanker source weights
source_weight_particle_typo=1.2,
source_weight_medial_confusion=1.1,
source_weight_semantic=1.15,
source_weight_symspell=1.0,
source_weight_morphology=0.9,
source_weight_compound=0.95,
source_weight_context=1.15,
source_weight_question_structure=1.0,
source_weight_pos_sequence=0.85,
source_weight_morpheme=0.85,
source_weight_medial_swap=1.0,
# Strategy score blending
strategy_score_weight=0.5,
)
ranker = DefaultRanker(ranker_config=config)
Integration with SymSpell
from myspellchecker.algorithms.symspell import SymSpell
from myspellchecker.algorithms.ranker import FrequencyFirstRanker
# Use custom ranker with SymSpell
ranker = FrequencyFirstRanker()
symspell = SymSpell(provider, ranker=ranker)
# Suggestions are ranked by the custom ranker
suggestions = symspell.lookup("မျန်မာ", level='word')
UnifiedRanker Features
Deduplication
ranker = UnifiedRanker()
suggestions = [
SuggestionData(term="ကြောင်း", source="symspell", confidence=0.8),
SuggestionData(term="ကြောင်း", source="medial_confusion", confidence=0.95),
]
# Keeps highest-confidence version
ranked = ranker.rank_suggestions(suggestions, deduplicate=True)
# Result: [SuggestionData(term="ကြောင်း", source="medial_confusion")]
Batch Ranking
suggestions = [
SuggestionData(term="word1", ...),
SuggestionData(term="word2", ...),
SuggestionData(term="word3", ...),
]
# Rank and sort all suggestions
ranked = ranker.rank_suggestions(suggestions)
# Returns: Sorted list, best first
Nasal Variant Handling
Myanmar has multiple nasal endings that are often confused:
| Ending | Phonetic | Example |
|---|
| န် | /n/ | ကန် |
| ံ | /n/ (anusvara) | ကံ |
| မ် | /m/ | ကမ် |
| င် | /ŋ/ | ကင် |
# Nasal variants get bonus
data1 = SuggestionData(
term="ကန်",
edit_distance=1,
is_nasal_variant=True, # True for န် ↔ ံ
)
data2 = SuggestionData(
term="ကမ်",
edit_distance=1,
is_nasal_variant=False, # Different nasal
)
# data1 gets nasal_bonus, scores lower (better)
Custom Rankers
Implement custom ranking strategy:
from myspellchecker.algorithms.ranker import SuggestionRanker, SuggestionData
class CustomRanker(SuggestionRanker):
@property
def name(self) -> str:
return "custom"
def score(self, data: SuggestionData) -> float:
# Custom scoring logic
base = float(data.edit_distance)
# Boost exact syllable structure matches
if data.syllable_distance == 0:
base -= 0.5
# Heavy frequency penalty for rare words
if data.frequency < 100:
base += 0.3
return base
# Use custom ranker
symspell = SymSpell(provider, ranker=CustomRanker())
Neural Reranker
After the primary ranker scores suggestions, an optional neural reranker (MLP) can reorder them based on learned patterns. This is configured via NeuralRerankerConfig.
from myspellchecker.core.config import SpellCheckerConfig, NeuralRerankerConfig
config = SpellCheckerConfig(
neural_reranker=NeuralRerankerConfig(
enabled=True,
model_path="./reranker/model.onnx",
stats_path="./reranker/stats.json",
confidence_gap_threshold=0.15,
max_candidates=20,
),
)
The neural reranker:
- Uses an MLP (20→64→1) trained with cross-entropy loss on suggestion quality signals
- Runs ONNX inference to score each candidate
- Skips reranking when the confidence gap between top-2 suggestions exceeds
confidence_gap_threshold (the top suggestion is already clearly best)
- Caps candidates at
max_candidates per error for performance
Feature Vector (19 dimensions)
Each candidate is represented by a 19-dimensional feature vector:
| Index | Feature | Description |
|---|
| 0 | edit_distance | Raw Damerau-Levenshtein distance |
| 1 | weighted_distance | Myanmar-weighted edit distance |
| 2 | log_frequency | log1p(word_frequency) |
| 3 | phonetic_score | Phonetic similarity [0, 1] |
| 4 | syllable_count_diff | Absolute syllable count difference |
| 5 | plausibility_ratio | weighted_dist / raw_dist |
| 6 | span_length_ratio | len(candidate) / len(error) |
| 7 | mlm_logit | MLM logit score (0 if unavailable) |
| 8 | ngram_left_prob | Left context N-gram probability |
| 9 | ngram_right_prob | Right context N-gram probability |
| 10 | is_confusable | 1.0 if Myanmar confusable variant |
| 11 | relative_log_freq | log_freq / max(log_freq) across candidates |
| 12 | char_length_diff | len(candidate) - len(error), signed |
| 13 | is_substring | 1.0 if candidate contains error or vice versa |
| 14 | original_rank | 1/(1+rank), a prior ranking signal |
| 15 | ngram_improvement_ratio | log(P_cand_ctx / P_error_ctx) |
| 16 | edit_type_subst | 1.0 if primary edit is substitution |
| 17 | edit_type_delete | 1.0 if primary edit is deletion/insertion |
| 18 | char_dice_coeff | Character bigram Dice coefficient |
See Neural Reranker for model types, inference, and training details.
Integration Flow
The neural reranker runs as the final step in the suggestion pipeline:
- SymSpell generates initial candidates
- N-gram context rescores using left/right probabilities
- Targeted rerank rules apply heuristic promotions/injections
- Neural reranker extracts 19 features, runs ONNX MLP, reorders by score
Training a Reranker
from myspellchecker.training.reranker_data import RerankerDataGenerator
from myspellchecker.training.reranker_trainer import RerankerTrainer
# Step 1: Generate training data from corpus + database
generator = RerankerDataGenerator(
db_path="mySpellChecker.db",
arrow_corpus_path="corpus.arrow",
)
generator.generate(num_examples=100_000, output_path="data/reranker_train.jsonl")
# Step 2: Train the MLP
trainer = RerankerTrainer("data/reranker_train.jsonl")
metrics = trainer.train(epochs=20)
# Step 3: Export to ONNX
trainer.export_onnx("models/reranker.onnx")
# Outputs: reranker.onnx + reranker.onnx.stats.json
| Ranker | Score Time | Notes |
|---|
| EditDistanceOnly | ~0.1μs | Fastest |
| DefaultRanker | ~1μs | Balanced |
| FrequencyFirst | ~0.5μs | Log calculation |
| PhoneticFirst | ~0.5μs | Simple formula |
| UnifiedRanker | ~2μs | Source lookup + base score |
| NeuralReranker | ~50μs | ONNX MLP inference (optional second pass) |
See Also