Suggestion Ranking - mySpellChecker

The suggestion ranking system determines how spelling corrections are scored and ordered. Multiple ranking strategies are available, from simple edit distance to sophisticated multi-factor ranking.

Overview

from myspellchecker.algorithms.ranker import DefaultRanker, SuggestionData

ranker = DefaultRanker()

# Score a suggestion
data = SuggestionData(
    term="မြန်မာ",
    edit_distance=1,
    frequency=10000,
    phonetic_score=0.8,
)
score = ranker.score(data)
print(f"Score: {score}")  # Lower is better

SuggestionData

All ranking input is encapsulated in SuggestionData:

from myspellchecker.algorithms.ranker import SuggestionData

data = SuggestionData(
    term="မြန်မာ",              # Suggested correction
    edit_distance=1,           # Levenshtein distance
    frequency=10000,           # Corpus frequency
    phonetic_score=0.8,        # Phonetic similarity (0-1)
    syllable_distance=0.5,     # Myanmar syllable-aware distance
    weighted_distance=0.8,     # Myanmar-weighted edit distance
    is_nasal_variant=False,    # True if nasal ending difference
    has_same_nasal_ending=True,# Same nasal consonant
    source="symspell",         # Origin: symspell, particle_typo, etc.
    confidence=1.0,            # Source-specific confidence
    strategy_score=None,       # Strategy-level score (optional)
    score_breakdown=None,      # Debug info with component scores (optional)
)

Data Fields

Field	Type	Description
`term`	str	The suggested word
`edit_distance`	int	Damerau-Levenshtein distance
`frequency`	int	Word frequency in corpus
`phonetic_score`	float	Phonetic similarity (0.0-1.0)
`syllable_distance`	float	Myanmar syllable-aware distance
`weighted_distance`	float	Myanmar-weighted edit distance using substitution costs
`is_nasal_variant`	bool	Nasal ending variant (န်↔ံ)
`has_same_nasal_ending`	bool	Same nasal consonant ending
`source`	str	Suggestion origin
`confidence`	float	Source confidence (0.0-1.0)
`strategy_score`	float	Strategy-level score for blending (optional)
`score_breakdown`	dict	Debug info with component scores (optional)
`pos_fit_score`	float	POS bigram fit score from context analysis (optional)
`error_length`	int	Character length of original error span for span-length bonus (optional)

Ranking Strategies

DefaultRanker

Balanced ranking considering multiple factors:

from myspellchecker.algorithms.ranker import DefaultRanker

ranker = DefaultRanker()

Scoring Formula:

score = edit_distance * plausibility
        - freq_bonus - phonetic_bonus - nasal_bonus
        - same_nasal_bonus - pos_bonus - span_bonus

Bonuses:

Bonus	Range	Description
`freq_bonus`	0.0-0.8	Higher frequency reduces score: `0.8 * (1 - 1/(1 + freq/denominator))`
`phonetic_bonus`	0.0-0.4	Phonetic similarity bonus (weight=0.4)
`nasal_bonus`	0.0-0.15	Nasal variant matching (weight=0.15)
`same_nasal_bonus`	0.0-0.25	Same nasal ending (weight=0.25)
`pos_bonus`	0.0-0.25	POS bigram fit score from context (weight=0.25)
`span_bonus`	0.1-1.4	Length-scaled bonus for error-span matching with tiered scoring

FrequencyFirstRanker

Prioritizes common words over edit distance:

from myspellchecker.algorithms.ranker import FrequencyFirstRanker

ranker = FrequencyFirstRanker()

Scoring Formula:

score = edit_distance * edit_weight - log1p(frequency) * freq_scale

Use Case: Autocomplete-style suggestions where common words are preferred.

EditDistanceOnlyRanker

Simple ranking by edit distance only:

from myspellchecker.algorithms.ranker import EditDistanceOnlyRanker

ranker = EditDistanceOnlyRanker()
score = ranker.score(data)  # Returns edit_distance directly

Use Case: Testing, debugging, or when frequency data is unavailable.

PhoneticFirstRanker

Prioritizes phonetically similar words:

from myspellchecker.algorithms.ranker import PhoneticFirstRanker

ranker = PhoneticFirstRanker()

Scoring Formula:

score = edit_distance * edit_weight - phonetic_score * phonetic_weight

Use Case: Myanmar text with common phonetic confusions (medial swaps).

UnifiedRanker

Consolidates suggestions from multiple sources:

from myspellchecker.algorithms.ranker import UnifiedRanker

ranker = UnifiedRanker()

# Score with source awareness
data = SuggestionData(
    term="ကြောင်း",
    edit_distance=1,
    frequency=5000,
    source="medial_confusion",  # High-priority source
    confidence=0.95,
)
score = ranker.score(data)  # Boosted by source weight

Source Weights:

Source	Default Weight	Description
`particle_typo`	1.2	Grammar rule match
`semantic`	1.15	Semantic model
`context`	1.15	Context-aware re-ranking
`medial_confusion`	1.1	Ya-pin/Ya-yit swap
`symspell`	1.0	Statistical (baseline)
`question_structure`	1.0	Question structure
`compound`	0.95	Compound word splitting
`morphology`	0.9	Morphological analysis
`morpheme`	0.85	Morpheme-level correction
`pos_sequence`	0.85	POS sequence
`medial_swap`	1.0	Medial swap variants

Configuration

RankerConfig

from myspellchecker.core.config import RankerConfig

config = RankerConfig(
    # DefaultRanker parameters
    frequency_denominator=10000.0,
    phonetic_bonus_weight=0.4,
    syllable_bonus_weight=0.3,
    nasal_bonus_weight=0.15,
    same_nasal_bonus_weight=0.25,
    weighted_distance_bonus_weight=0.35,

    # FrequencyFirstRanker parameters
    frequency_first_edit_weight=0.5,
    frequency_first_scale=0.1,

    # PhoneticFirstRanker parameters
    phonetic_first_weight=1.0,
    phonetic_first_edit_weight=0.3,

    # UnifiedRanker source weights
    source_weight_particle_typo=1.2,
    source_weight_medial_confusion=1.1,
    source_weight_semantic=1.15,
    source_weight_symspell=1.0,
    source_weight_morphology=0.9,
    source_weight_compound=0.95,
    source_weight_context=1.15,
    source_weight_question_structure=1.0,
    source_weight_pos_sequence=0.85,
    source_weight_morpheme=0.85,
    source_weight_medial_swap=1.0,

    # Strategy score blending
    strategy_score_weight=0.5,
)

ranker = DefaultRanker(ranker_config=config)

Integration with SymSpell

from myspellchecker.algorithms.symspell import SymSpell
from myspellchecker.algorithms.ranker import FrequencyFirstRanker

# Use custom ranker with SymSpell
ranker = FrequencyFirstRanker()
symspell = SymSpell(provider, ranker=ranker)

# Suggestions are ranked by the custom ranker
suggestions = symspell.lookup("မျန်မာ", level='word')

UnifiedRanker Features

Deduplication

ranker = UnifiedRanker()

suggestions = [
    SuggestionData(term="ကြောင်း", source="symspell", confidence=0.8),
    SuggestionData(term="ကြောင်း", source="medial_confusion", confidence=0.95),
]

# Keeps highest-confidence version
ranked = ranker.rank_suggestions(suggestions, deduplicate=True)
# Result: [SuggestionData(term="ကြောင်း", source="medial_confusion")]

Batch Ranking

suggestions = [
    SuggestionData(term="word1", ...),
    SuggestionData(term="word2", ...),
    SuggestionData(term="word3", ...),
]

# Rank and sort all suggestions
ranked = ranker.rank_suggestions(suggestions)
# Returns: Sorted list, best first

Nasal Variant Handling

Myanmar has multiple nasal endings that are often confused:

Ending	Phonetic	Example
န်	/n/	ကန်
ံ	/n/ (anusvara)	ကံ
မ်	/m/	ကမ်
င်	/ŋ/	ကင်

# Nasal variants get bonus
data1 = SuggestionData(
    term="ကန်",
    edit_distance=1,
    is_nasal_variant=True,  # True for န် ↔ ံ
)

data2 = SuggestionData(
    term="ကမ်",
    edit_distance=1,
    is_nasal_variant=False,  # Different nasal
)

# data1 gets nasal_bonus, scores lower (better)

Custom Rankers

Implement custom ranking strategy:

from myspellchecker.algorithms.ranker import SuggestionRanker, SuggestionData

class CustomRanker(SuggestionRanker):
    @property
    def name(self) -> str:
        return "custom"

    def score(self, data: SuggestionData) -> float:
        # Custom scoring logic
        base = float(data.edit_distance)

        # Boost exact syllable structure matches
        if data.syllable_distance == 0:
            base -= 0.5

        # Heavy frequency penalty for rare words
        if data.frequency < 100:
            base += 0.3

        return base

# Use custom ranker
symspell = SymSpell(provider, ranker=CustomRanker())

Neural Reranker

After the primary ranker scores suggestions, an optional neural reranker (MLP) can reorder them based on learned patterns. This is configured via NeuralRerankerConfig.

from myspellchecker.core.config import SpellCheckerConfig, NeuralRerankerConfig

config = SpellCheckerConfig(
    neural_reranker=NeuralRerankerConfig(
        enabled=True,
        model_path="./reranker/model.onnx",
        stats_path="./reranker/stats.json",
        confidence_gap_threshold=0.15,
        max_candidates=20,
    ),
)

The neural reranker:

Uses an MLP (20→64→1) trained with cross-entropy loss on suggestion quality signals
Runs ONNX inference to score each candidate
Skips reranking when the confidence gap between top-2 suggestions exceeds confidence_gap_threshold (the top suggestion is already clearly best)
Caps candidates at max_candidates per error for performance

Feature Vector (19 dimensions)

Each candidate is represented by a 19-dimensional feature vector:

Index	Feature	Description
0	`edit_distance`	Raw Damerau-Levenshtein distance
1	`weighted_distance`	Myanmar-weighted edit distance
2	`log_frequency`	`log1p(word_frequency)`
3	`phonetic_score`	Phonetic similarity [0, 1]
4	`syllable_count_diff`	Absolute syllable count difference
5	`plausibility_ratio`	`weighted_dist / raw_dist`
6	`span_length_ratio`	`len(candidate) / len(error)`
7	`mlm_logit`	MLM logit score (0 if unavailable)
8	`ngram_left_prob`	Left context N-gram probability
9	`ngram_right_prob`	Right context N-gram probability
10	`is_confusable`	1.0 if Myanmar confusable variant
11	`relative_log_freq`	`log_freq / max(log_freq)` across candidates
12	`char_length_diff`	`len(candidate) - len(error)`, signed
13	`is_substring`	1.0 if candidate contains error or vice versa
14	`original_rank`	`1/(1+rank)`, a prior ranking signal
15	`ngram_improvement_ratio`	`log(P_cand_ctx / P_error_ctx)`
16	`edit_type_subst`	1.0 if primary edit is substitution
17	`edit_type_delete`	1.0 if primary edit is deletion/insertion
18	`char_dice_coeff`	Character bigram Dice coefficient

See Neural Reranker for model types, inference, and training details.

Integration Flow

The neural reranker runs as the final step in the suggestion pipeline:

SymSpell generates initial candidates
N-gram context rescores using left/right probabilities
Targeted rerank rules apply heuristic promotions/injections
Neural reranker extracts 19 features, runs ONNX MLP, reorders by score

Training a Reranker

from myspellchecker.training.reranker_data import RerankerDataGenerator
from myspellchecker.training.reranker_trainer import RerankerTrainer

# Step 1: Generate training data from corpus + database
generator = RerankerDataGenerator(
    db_path="mySpellChecker.db",
    arrow_corpus_path="corpus.arrow",
)
generator.generate(num_examples=100_000, output_path="data/reranker_train.jsonl")

# Step 2: Train the MLP
trainer = RerankerTrainer("data/reranker_train.jsonl")
metrics = trainer.train(epochs=20)

# Step 3: Export to ONNX
trainer.export_onnx("models/reranker.onnx")
# Outputs: reranker.onnx + reranker.onnx.stats.json

Performance

Ranker	Score Time	Notes
EditDistanceOnly	~0.1μs	Fastest
DefaultRanker	~1μs	Balanced
FrequencyFirst	~0.5μs	Log calculation
PhoneticFirst	~0.5μs	Simple formula
UnifiedRanker	~2μs	Source lookup + base score
NeuralReranker	~50μs	ONNX MLP inference (optional second pass)

​Overview

​SuggestionData

​Data Fields

​Ranking Strategies

​DefaultRanker

​FrequencyFirstRanker

​EditDistanceOnlyRanker

​PhoneticFirstRanker

​UnifiedRanker

​Configuration

​RankerConfig

​Integration with SymSpell

​UnifiedRanker Features

​Deduplication

​Batch Ranking

​Nasal Variant Handling

​Custom Rankers

​Neural Reranker

​Feature Vector (19 dimensions)

​Integration Flow

​Training a Reranker

​Performance

​See Also

Overview

SuggestionData

Data Fields

Ranking Strategies

DefaultRanker

FrequencyFirstRanker

EditDistanceOnlyRanker

PhoneticFirstRanker

UnifiedRanker

Configuration

RankerConfig

Integration with SymSpell

UnifiedRanker Features

Deduplication

Batch Ranking

Nasal Variant Handling

Custom Rankers

Neural Reranker

Feature Vector (19 dimensions)

Integration Flow

Training a Reranker

Performance

See Also