Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The neural reranker is the final stage of the suggestion pipeline, scoring candidates using a trained ONNX model after rule-based and N-gram ranking. It supports both MLP (multi-layer perceptron) and GBT (gradient-boosted tree) model types, auto-detected from the ONNX model’s input shape.
Overview
Suggestion Pipeline:
SymSpell candidates → Rule-based ranking → N-gram reranking → Neural reranking
↑ you are here
The reranker takes the candidate list with extracted features and produces a score for each candidate, reordering them by the model’s prediction of which correction is most likely correct.
Model Types
| Type | Input Shape | Normalization | Best For |
|---|
| MLP | (batch, candidates, features) | Z-score (requires stats file) | Production deployment |
| GBT | (N, features) | None (scale-invariant) | Experimentation |
Model type is auto-detected from the ONNX input shape — no manual configuration needed.
Feature Vector
The reranker uses 19 features extracted from each candidate:
| # | Feature | Description |
|---|
| 0 | edit_distance | Raw Damerau-Levenshtein distance |
| 1 | weighted_distance | Myanmar-weighted edit distance |
| 2 | log_frequency | log1p(word_frequency) |
| 3 | phonetic_score | Phonetic similarity [0, 1] |
| 4 | syllable_count_diff | Absolute syllable count difference |
| 5 | plausibility_ratio | weighted_dist / raw_dist |
| 6 | span_length_ratio | len(candidate) / len(error) |
| 7 | mlm_logit | MLM logit from semantic checker |
| 8 | ngram_left_prob | Left context N-gram probability |
| 9 | ngram_right_prob | Right context N-gram probability |
| 10 | is_confusable | 1.0 if Myanmar confusable variant |
| 11 | relative_log_freq | log_freq / max(log_freq) within candidates |
| 12 | char_length_diff | len(candidate) - len(error), signed |
| 13 | is_substring | 1.0 if substring relationship exists |
| 14 | original_rank | 1/(1+rank) prior ranking signal |
| 15 | ngram_improvement_ratio | log(P_cand_ctx / P_error_ctx) |
| 16 | edit_type_subst | 1.0 if primary edit is substitution |
| 17 | edit_type_delete | 1.0 if primary edit is deletion/insertion |
| 18 | char_dice_coeff | Character bigram Dice coefficient |
For MLP models with feature_schema == "mlp_v3", the reranker automatically applies transforms at inference time:
- Drops
original_rank feature (index 14) to prevent ranking leakage
- Computes cross-features as configured in the stats file
Usage
Basic Usage
from myspellchecker.algorithms.neural_reranker import NeuralReranker
reranker = NeuralReranker(
model_path="path/to/reranker.onnx",
stats_path="path/to/reranker_stats.json", # Required for MLP
)
# Score candidates
features = [
[1.0, 0.8, 9.2, 0.9, 0, ...], # candidate 1 features (19 values)
[2.0, 1.5, 7.1, 0.6, 1, ...], # candidate 2 features
]
scores = reranker.score_candidates(features)
# Returns: [0.92, 0.45] — higher is better
With SpellCheckerBuilder
from myspellchecker.core import SpellCheckerBuilder
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import RankerConfig
config = SpellCheckerConfig(
ranker=RankerConfig(
reranker_model_path="path/to/reranker.onnx",
reranker_stats_path="path/to/reranker_stats.json",
)
)
checker = (
SpellCheckerBuilder()
.with_config(config)
.build()
)
The JSON stats file contains normalization parameters for MLP models:
{
"feature_schema": "mlp_v3",
"feature_means": [1.2, 0.9, 8.5, ...],
"feature_stds": [0.8, 0.4, 2.1, ...],
"drop_original_rank": true,
"cross_features": ["freq_x_edit"]
}
| Field | Description |
|---|
feature_schema | Schema version (e.g., “mlp_v3”) |
feature_means | Per-feature means for z-score normalization |
feature_stds | Per-feature standard deviations |
drop_original_rank | Whether to drop the original_rank feature |
cross_features | List of cross-features to compute |
Training
Reranker models are trained offline using the training pipeline:
- Generate training data (
training/reranker_data.py): Extracts feature vectors from benchmark examples
- Train model (
training/reranker_trainer.py): Trains MLP and exports to ONNX with quantization
- Evaluate: Compare MRR and Top-1 accuracy against the baseline ranker
See the Training Guide for details on the training pipeline.
- Latency: ~0.5ms per candidate batch (ONNX optimized)
- Memory: ~2-5MB model size (quantized)
- Dependencies: Requires
onnxruntime (pip install myspellchecker[ai])
See Also