Skip to main content
The suggestion ranking system determines how spelling corrections are scored and ordered. Multiple ranking strategies are available, from simple edit distance to sophisticated multi-factor ranking.

Overview

from myspellchecker.algorithms.ranker import DefaultRanker, SuggestionData

ranker = DefaultRanker()

# Score a suggestion
data = SuggestionData(
    term="မြန်မာ",
    edit_distance=1,
    frequency=10000,
    phonetic_score=0.8,
)
score = ranker.score(data)
print(f"Score: {score}")  # Lower is better

SuggestionData

All ranking input is encapsulated in SuggestionData:
from myspellchecker.algorithms.ranker import SuggestionData

data = SuggestionData(
    term="မြန်မာ",              # Suggested correction
    edit_distance=1,           # Levenshtein distance
    frequency=10000,           # Corpus frequency
    phonetic_score=0.8,        # Phonetic similarity (0-1)
    syllable_distance=0.5,     # Myanmar syllable-aware distance
    weighted_distance=0.8,     # Myanmar-weighted edit distance
    is_nasal_variant=False,    # True if nasal ending difference
    has_same_nasal_ending=True,# Same nasal consonant
    source="symspell",         # Origin: symspell, particle_typo, etc.
    confidence=1.0,            # Source-specific confidence
    strategy_score=None,       # Strategy-level score (optional)
    score_breakdown=None,      # Debug info with component scores (optional)
)

Data Fields

FieldTypeDescription
termstrThe suggested word
edit_distanceintDamerau-Levenshtein distance
frequencyintWord frequency in corpus
phonetic_scorefloatPhonetic similarity (0.0-1.0)
syllable_distancefloatMyanmar syllable-aware distance
weighted_distancefloatMyanmar-weighted edit distance using substitution costs
is_nasal_variantboolNasal ending variant (န်↔ံ)
has_same_nasal_endingboolSame nasal consonant ending
sourcestrSuggestion origin
confidencefloatSource confidence (0.0-1.0)
strategy_scorefloatStrategy-level score for blending (optional)
score_breakdowndictDebug info with component scores (optional)
pos_fit_scorefloatPOS bigram fit score from context analysis (optional)
error_lengthintCharacter length of original error span for span-length bonus (optional)

Ranking Strategies

DefaultRanker

Balanced ranking considering multiple factors:
from myspellchecker.algorithms.ranker import DefaultRanker

ranker = DefaultRanker()
Scoring Formula:
score = edit_distance * plausibility
        - freq_bonus - phonetic_bonus - nasal_bonus
        - same_nasal_bonus - pos_bonus - span_bonus
Bonuses:
BonusRangeDescription
freq_bonus0.0-0.8Higher frequency reduces score: 0.8 * (1 - 1/(1 + freq/denominator))
phonetic_bonus0.0-0.4Phonetic similarity bonus (weight=0.4)
nasal_bonus0.0-0.15Nasal variant matching (weight=0.15)
same_nasal_bonus0.0-0.25Same nasal ending (weight=0.25)
pos_bonus0.0-0.25POS bigram fit score from context (weight=0.25)
span_bonus0.1-1.4Length-scaled bonus for error-span matching with tiered scoring

FrequencyFirstRanker

Prioritizes common words over edit distance:
from myspellchecker.algorithms.ranker import FrequencyFirstRanker

ranker = FrequencyFirstRanker()
Scoring Formula:
score = edit_distance * edit_weight - log1p(frequency) * freq_scale
Use Case: Autocomplete-style suggestions where common words are preferred.

EditDistanceOnlyRanker

Simple ranking by edit distance only:
from myspellchecker.algorithms.ranker import EditDistanceOnlyRanker

ranker = EditDistanceOnlyRanker()
score = ranker.score(data)  # Returns edit_distance directly
Use Case: Testing, debugging, or when frequency data is unavailable.

PhoneticFirstRanker

Prioritizes phonetically similar words:
from myspellchecker.algorithms.ranker import PhoneticFirstRanker

ranker = PhoneticFirstRanker()
Scoring Formula:
score = edit_distance * edit_weight - phonetic_score * phonetic_weight
Use Case: Myanmar text with common phonetic confusions (medial swaps).

UnifiedRanker

Consolidates suggestions from multiple sources:
from myspellchecker.algorithms.ranker import UnifiedRanker

ranker = UnifiedRanker()

# Score with source awareness
data = SuggestionData(
    term="ကြောင်း",
    edit_distance=1,
    frequency=5000,
    source="medial_confusion",  # High-priority source
    confidence=0.95,
)
score = ranker.score(data)  # Boosted by source weight
Source Weights:
SourceDefault WeightDescription
particle_typo1.2Grammar rule match
semantic1.15Semantic model
context1.15Context-aware re-ranking
medial_confusion1.1Ya-pin/Ya-yit swap
symspell1.0Statistical (baseline)
question_structure1.0Question structure
compound0.95Compound word splitting
morphology0.9Morphological analysis
morpheme0.85Morpheme-level correction
pos_sequence0.85POS sequence
medial_swap1.0Medial swap variants

Configuration

RankerConfig

from myspellchecker.core.config import RankerConfig

config = RankerConfig(
    # DefaultRanker parameters
    frequency_denominator=10000.0,
    phonetic_bonus_weight=0.4,
    syllable_bonus_weight=0.3,
    nasal_bonus_weight=0.15,
    same_nasal_bonus_weight=0.25,
    weighted_distance_bonus_weight=0.35,

    # FrequencyFirstRanker parameters
    frequency_first_edit_weight=0.5,
    frequency_first_scale=0.1,

    # PhoneticFirstRanker parameters
    phonetic_first_weight=1.0,
    phonetic_first_edit_weight=0.3,

    # UnifiedRanker source weights
    source_weight_particle_typo=1.2,
    source_weight_medial_confusion=1.1,
    source_weight_semantic=1.15,
    source_weight_symspell=1.0,
    source_weight_morphology=0.9,
    source_weight_compound=0.95,
    source_weight_context=1.15,
    source_weight_question_structure=1.0,
    source_weight_pos_sequence=0.85,
    source_weight_morpheme=0.85,
    source_weight_medial_swap=1.0,

    # Strategy score blending
    strategy_score_weight=0.5,
)

ranker = DefaultRanker(ranker_config=config)

Integration with SymSpell

from myspellchecker.algorithms.symspell import SymSpell
from myspellchecker.algorithms.ranker import FrequencyFirstRanker

# Use custom ranker with SymSpell
ranker = FrequencyFirstRanker()
symspell = SymSpell(provider, ranker=ranker)

# Suggestions are ranked by the custom ranker
suggestions = symspell.lookup("မျန်မာ", level='word')

UnifiedRanker Features

Deduplication

ranker = UnifiedRanker()

suggestions = [
    SuggestionData(term="ကြောင်း", source="symspell", confidence=0.8),
    SuggestionData(term="ကြောင်း", source="medial_confusion", confidence=0.95),
]

# Keeps highest-confidence version
ranked = ranker.rank_suggestions(suggestions, deduplicate=True)
# Result: [SuggestionData(term="ကြောင်း", source="medial_confusion")]

Batch Ranking

suggestions = [
    SuggestionData(term="word1", ...),
    SuggestionData(term="word2", ...),
    SuggestionData(term="word3", ...),
]

# Rank and sort all suggestions
ranked = ranker.rank_suggestions(suggestions)
# Returns: Sorted list, best first

Nasal Variant Handling

Myanmar has multiple nasal endings that are often confused:
EndingPhoneticExample
န်/n/ကန်
/n/ (anusvara)ကံ
မ်/m/ကမ်
င်/ŋ/ကင်
# Nasal variants get bonus
data1 = SuggestionData(
    term="ကန်",
    edit_distance=1,
    is_nasal_variant=True,  # True for န် ↔ ံ
)

data2 = SuggestionData(
    term="ကမ်",
    edit_distance=1,
    is_nasal_variant=False,  # Different nasal
)

# data1 gets nasal_bonus, scores lower (better)

Custom Rankers

Implement custom ranking strategy:
from myspellchecker.algorithms.ranker import SuggestionRanker, SuggestionData

class CustomRanker(SuggestionRanker):
    @property
    def name(self) -> str:
        return "custom"

    def score(self, data: SuggestionData) -> float:
        # Custom scoring logic
        base = float(data.edit_distance)

        # Boost exact syllable structure matches
        if data.syllable_distance == 0:
            base -= 0.5

        # Heavy frequency penalty for rare words
        if data.frequency < 100:
            base += 0.3

        return base

# Use custom ranker
symspell = SymSpell(provider, ranker=CustomRanker())

Neural Reranker

After the primary ranker scores suggestions, an optional neural reranker (MLP) can reorder them based on learned patterns. This is configured via NeuralRerankerConfig.
from myspellchecker.core.config import SpellCheckerConfig, NeuralRerankerConfig

config = SpellCheckerConfig(
    neural_reranker=NeuralRerankerConfig(
        enabled=True,
        model_path="./reranker/model.onnx",
        stats_path="./reranker/stats.json",
        confidence_gap_threshold=0.15,
        max_candidates=20,
    ),
)
The neural reranker:
  • Uses an MLP (20→64→1) trained with cross-entropy loss on suggestion quality signals
  • Runs ONNX inference to score each candidate
  • Skips reranking when the confidence gap between top-2 suggestions exceeds confidence_gap_threshold (the top suggestion is already clearly best)
  • Caps candidates at max_candidates per error for performance

Feature Vector (19 dimensions)

Each candidate is represented by a 19-dimensional feature vector:
IndexFeatureDescription
0edit_distanceRaw Damerau-Levenshtein distance
1weighted_distanceMyanmar-weighted edit distance
2log_frequencylog1p(word_frequency)
3phonetic_scorePhonetic similarity [0, 1]
4syllable_count_diffAbsolute syllable count difference
5plausibility_ratioweighted_dist / raw_dist
6span_length_ratiolen(candidate) / len(error)
7mlm_logitMLM logit score (0 if unavailable)
8ngram_left_probLeft context N-gram probability
9ngram_right_probRight context N-gram probability
10is_confusable1.0 if Myanmar confusable variant
11relative_log_freqlog_freq / max(log_freq) across candidates
12char_length_difflen(candidate) - len(error), signed
13is_substring1.0 if candidate contains error or vice versa
14original_rank1/(1+rank), a prior ranking signal
15ngram_improvement_ratiolog(P_cand_ctx / P_error_ctx)
16edit_type_subst1.0 if primary edit is substitution
17edit_type_delete1.0 if primary edit is deletion/insertion
18char_dice_coeffCharacter bigram Dice coefficient
See Neural Reranker for model types, inference, and training details.

Integration Flow

The neural reranker runs as the final step in the suggestion pipeline:
  1. SymSpell generates initial candidates
  2. N-gram context rescores using left/right probabilities
  3. Targeted rerank rules apply heuristic promotions/injections
  4. Neural reranker extracts 19 features, runs ONNX MLP, reorders by score

Training a Reranker

from myspellchecker.training.reranker_data import RerankerDataGenerator
from myspellchecker.training.reranker_trainer import RerankerTrainer

# Step 1: Generate training data from corpus + database
generator = RerankerDataGenerator(
    db_path="mySpellChecker.db",
    arrow_corpus_path="corpus.arrow",
)
generator.generate(num_examples=100_000, output_path="data/reranker_train.jsonl")

# Step 2: Train the MLP
trainer = RerankerTrainer("data/reranker_train.jsonl")
metrics = trainer.train(epochs=20)

# Step 3: Export to ONNX
trainer.export_onnx("models/reranker.onnx")
# Outputs: reranker.onnx + reranker.onnx.stats.json

Performance

RankerScore TimeNotes
EditDistanceOnly~0.1μsFastest
DefaultRanker~1μsBalanced
FrequencyFirst~0.5μsLog calculation
PhoneticFirst~0.5μsSimple formula
UnifiedRanker~2μsSource lookup + base score
NeuralReranker~50μsONNX MLP inference (optional second pass)

See Also