Neural Reranker - mySpellChecker

The neural reranker is the final stage of the suggestion pipeline, scoring candidates using a trained ONNX model after rule-based and N-gram ranking. It supports both MLP (multi-layer perceptron) and GBT (gradient-boosted tree) model types, auto-detected from the ONNX model’s input shape.

Overview

Suggestion Pipeline:
  SymSpell candidates → Rule-based ranking → N-gram reranking → Neural reranking
                                                                  ↑ you are here

The reranker takes the candidate list with extracted features and produces a score for each candidate, reordering them by the model’s prediction of which correction is most likely correct.

Model Types

Type	Input Shape	Normalization	Best For
MLP	(batch, candidates, features)	Z-score (requires stats file)	Production deployment
GBT	(N, features)	None (scale-invariant)	Experimentation

Model type is auto-detected from the ONNX input shape — no manual configuration needed.

Feature Vector

The reranker uses 19 features extracted from each candidate:

#	Feature	Description
0	`edit_distance`	Raw Damerau-Levenshtein distance
1	`weighted_distance`	Myanmar-weighted edit distance
2	`log_frequency`	log1p(word_frequency)
3	`phonetic_score`	Phonetic similarity [0, 1]
4	`syllable_count_diff`	Absolute syllable count difference
5	`plausibility_ratio`	weighted_dist / raw_dist
6	`span_length_ratio`	len(candidate) / len(error)
7	`mlm_logit`	MLM logit from semantic checker
8	`ngram_left_prob`	Left context N-gram probability
9	`ngram_right_prob`	Right context N-gram probability
10	`is_confusable`	1.0 if Myanmar confusable variant
11	`relative_log_freq`	log_freq / max(log_freq) within candidates
12	`char_length_diff`	len(candidate) - len(error), signed
13	`is_substring`	1.0 if substring relationship exists
14	`original_rank`	1/(1+rank) prior ranking signal
15	`ngram_improvement_ratio`	log(P_cand_ctx / P_error_ctx)
16	`edit_type_subst`	1.0 if primary edit is substitution
17	`edit_type_delete`	1.0 if primary edit is deletion/insertion
18	`char_dice_coeff`	Character bigram Dice coefficient

MLP v3 Feature Transforms

For MLP models with feature_schema == "mlp_v3", the reranker automatically applies transforms at inference time:

Drops original_rank feature (index 14) to prevent ranking leakage
Computes cross-features as configured in the stats file

Usage

Basic Usage

from myspellchecker.algorithms.neural_reranker import NeuralReranker

reranker = NeuralReranker(
    model_path="path/to/reranker.onnx",
    stats_path="path/to/reranker_stats.json",  # Required for MLP
)

# Score candidates
features = [
    [1.0, 0.8, 9.2, 0.9, 0, ...],  # candidate 1 features (19 values)
    [2.0, 1.5, 7.1, 0.6, 1, ...],  # candidate 2 features
]

scores = reranker.score_candidates(features)
# Returns: [0.92, 0.45] — higher is better

With SpellCheckerBuilder

from myspellchecker.core import SpellCheckerBuilder
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import RankerConfig

config = SpellCheckerConfig(
    ranker=RankerConfig(
        reranker_model_path="path/to/reranker.onnx",
        reranker_stats_path="path/to/reranker_stats.json",
    )
)

checker = (
    SpellCheckerBuilder()
    .with_config(config)
    .build()
)

Stats File Format

The JSON stats file contains normalization parameters for MLP models:

{
  "feature_schema": "mlp_v3",
  "feature_means": [1.2, 0.9, 8.5, ...],
  "feature_stds": [0.8, 0.4, 2.1, ...],
  "drop_original_rank": true,
  "cross_features": ["freq_x_edit"]
}

Field	Description
`feature_schema`	Schema version (e.g., “mlp_v3”)
`feature_means`	Per-feature means for z-score normalization
`feature_stds`	Per-feature standard deviations
`drop_original_rank`	Whether to drop the original_rank feature
`cross_features`	List of cross-features to compute

Training

Reranker models are trained offline using the training pipeline:

Generate training data (training/reranker_data.py): Extracts feature vectors from benchmark examples
Train model (training/reranker_trainer.py): Trains MLP and exports to ONNX with quantization
Evaluate: Compare MRR and Top-1 accuracy against the baseline ranker

See the Training Guide for details on the training pipeline.

Performance

Latency: ~0.5ms per candidate batch (ONNX optimized)
Memory: ~2-5MB model size (quantized)
Dependencies: Requires onnxruntime (pip install myspellchecker[ai])

​Overview

​Model Types

​Feature Vector

​MLP v3 Feature Transforms

​Usage

​Basic Usage

​With SpellCheckerBuilder

​Stats File Format

​Training

​Performance

​See Also