Overview
SuggestionData
All ranking input is encapsulated inSuggestionData:
Data Fields
| Field | Type | Description |
|---|---|---|
term | str | The suggested word |
edit_distance | int | Damerau-Levenshtein distance |
frequency | int | Word frequency in corpus |
phonetic_score | float | Phonetic similarity (0.0-1.0) |
syllable_distance | float | Myanmar syllable-aware distance |
weighted_distance | float | Myanmar-weighted edit distance using substitution costs |
is_nasal_variant | bool | Nasal ending variant (န်↔ံ) |
has_same_nasal_ending | bool | Same nasal consonant ending |
source | str | Suggestion origin |
confidence | float | Source confidence (0.0-1.0) |
strategy_score | float | Strategy-level score for blending (optional) |
score_breakdown | dict | Debug info with component scores (optional) |
pos_fit_score | float | POS bigram fit score from context analysis (optional) |
error_length | int | Character length of original error span for span-length bonus (optional) |
Ranking Strategies
DefaultRanker
Balanced ranking considering multiple factors:| Bonus | Range | Description |
|---|---|---|
freq_bonus | 0.0-0.8 | Higher frequency reduces score: 0.8 * (1 - 1/(1 + freq/denominator)) |
phonetic_bonus | 0.0-0.4 | Phonetic similarity bonus (weight=0.4) |
nasal_bonus | 0.0-0.15 | Nasal variant matching (weight=0.15) |
same_nasal_bonus | 0.0-0.25 | Same nasal ending (weight=0.25) |
pos_bonus | 0.0-0.25 | POS bigram fit score from context (weight=0.25) |
span_bonus | 0.1-1.4 | Length-scaled bonus for error-span matching with tiered scoring |
FrequencyFirstRanker
Prioritizes common words over edit distance:EditDistanceOnlyRanker
Simple ranking by edit distance only:PhoneticFirstRanker
Prioritizes phonetically similar words:UnifiedRanker
Consolidates suggestions from multiple sources:| Source | Default Weight | Description |
|---|---|---|
particle_typo | 1.2 | Grammar rule match |
semantic | 1.15 | Semantic model |
context | 1.15 | Context-aware re-ranking |
medial_confusion | 1.1 | Ya-pin/Ya-yit swap |
symspell | 1.0 | Statistical (baseline) |
question_structure | 1.0 | Question structure |
compound | 0.95 | Compound word splitting |
morphology | 0.9 | Morphological analysis |
morpheme | 0.85 | Morpheme-level correction |
pos_sequence | 0.85 | POS sequence |
medial_swap | 1.0 | Medial swap variants |
Configuration
RankerConfig
Integration with SymSpell
UnifiedRanker Features
Deduplication
Batch Ranking
Nasal Variant Handling
Myanmar has multiple nasal endings that are often confused:| Ending | Phonetic | Example |
|---|---|---|
| န် | /n/ | ကန် |
| ံ | /n/ (anusvara) | ကံ |
| မ် | /m/ | ကမ် |
| င် | /ŋ/ | ကင် |
Custom Rankers
Implement custom ranking strategy:Neural Reranker
After the primary ranker scores suggestions, an optional neural reranker (MLP) can reorder them based on learned patterns. This is configured viaNeuralRerankerConfig.
- Uses an MLP (20→64→1) trained with cross-entropy loss on suggestion quality signals
- Runs ONNX inference to score each candidate
- Skips reranking when the confidence gap between top-2 suggestions exceeds
confidence_gap_threshold(the top suggestion is already clearly best) - Caps candidates at
max_candidatesper error for performance
Feature Vector (19 dimensions)
Each candidate is represented by a 19-dimensional feature vector:| Index | Feature | Description |
|---|---|---|
| 0 | edit_distance | Raw Damerau-Levenshtein distance |
| 1 | weighted_distance | Myanmar-weighted edit distance |
| 2 | log_frequency | log1p(word_frequency) |
| 3 | phonetic_score | Phonetic similarity [0, 1] |
| 4 | syllable_count_diff | Absolute syllable count difference |
| 5 | plausibility_ratio | weighted_dist / raw_dist |
| 6 | span_length_ratio | len(candidate) / len(error) |
| 7 | mlm_logit | MLM logit score (0 if unavailable) |
| 8 | ngram_left_prob | Left context N-gram probability |
| 9 | ngram_right_prob | Right context N-gram probability |
| 10 | is_confusable | 1.0 if Myanmar confusable variant |
| 11 | relative_log_freq | log_freq / max(log_freq) across candidates |
| 12 | char_length_diff | len(candidate) - len(error), signed |
| 13 | is_substring | 1.0 if candidate contains error or vice versa |
| 14 | original_rank | 1/(1+rank), a prior ranking signal |
| 15 | ngram_improvement_ratio | log(P_cand_ctx / P_error_ctx) |
| 16 | edit_type_subst | 1.0 if primary edit is substitution |
| 17 | edit_type_delete | 1.0 if primary edit is deletion/insertion |
| 18 | char_dice_coeff | Character bigram Dice coefficient |
Integration Flow
The neural reranker runs as the final step in the suggestion pipeline:- SymSpell generates initial candidates
- N-gram context rescores using left/right probabilities
- Targeted rerank rules apply heuristic promotions/injections
- Neural reranker extracts 19 features, runs ONNX MLP, reorders by score
Training a Reranker
Performance
| Ranker | Score Time | Notes |
|---|---|---|
| EditDistanceOnly | ~0.1μs | Fastest |
| DefaultRanker | ~1μs | Balanced |
| FrequencyFirst | ~0.5μs | Log calculation |
| PhoneticFirst | ~0.5μs | Simple formula |
| UnifiedRanker | ~2μs | Source lookup + base score |
| NeuralReranker | ~50μs | ONNX MLP inference (optional second pass) |
See Also
- SymSpell Algorithm - Suggestion generation
- Edit Distance - Distance calculations
- Phonetic Matching - Phonetic scoring
- Configuration Guide - RankerConfig options