Overview
Model Types
| Type | Input Shape | Normalization | Best For |
|---|---|---|---|
| MLP | (batch, candidates, features) | Z-score (requires stats file) | Production deployment |
| GBT | (N, features) | None (scale-invariant) | Experimentation |
Feature Vector
The reranker uses 19 features extracted from each candidate:| # | Feature | Description |
|---|---|---|
| 0 | edit_distance | Raw Damerau-Levenshtein distance |
| 1 | weighted_distance | Myanmar-weighted edit distance |
| 2 | log_frequency | log1p(word_frequency) |
| 3 | phonetic_score | Phonetic similarity [0, 1] |
| 4 | syllable_count_diff | Absolute syllable count difference |
| 5 | plausibility_ratio | weighted_dist / raw_dist |
| 6 | span_length_ratio | len(candidate) / len(error) |
| 7 | mlm_logit | MLM logit from semantic checker |
| 8 | ngram_left_prob | Left context N-gram probability |
| 9 | ngram_right_prob | Right context N-gram probability |
| 10 | is_confusable | 1.0 if Myanmar confusable variant |
| 11 | relative_log_freq | log_freq / max(log_freq) within candidates |
| 12 | char_length_diff | len(candidate) - len(error), signed |
| 13 | is_substring | 1.0 if substring relationship exists |
| 14 | original_rank | 1/(1+rank) prior ranking signal |
| 15 | ngram_improvement_ratio | log(P_cand_ctx / P_error_ctx) |
| 16 | edit_type_subst | 1.0 if primary edit is substitution |
| 17 | edit_type_delete | 1.0 if primary edit is deletion/insertion |
| 18 | char_dice_coeff | Character bigram Dice coefficient |
MLP v3 Feature Transforms
For MLP models withfeature_schema == "mlp_v3", the reranker automatically applies transforms at inference time:
- Drops
original_rankfeature (index 14) to prevent ranking leakage - Computes cross-features as configured in the stats file
Usage
Basic Usage
With SpellCheckerBuilder
Stats File Format
The JSON stats file contains normalization parameters for MLP models:| Field | Description |
|---|---|
feature_schema | Schema version (e.g., “mlp_v3”) |
feature_means | Per-feature means for z-score normalization |
feature_stds | Per-feature standard deviations |
drop_original_rank | Whether to drop the original_rank feature |
cross_features | List of cross-features to compute |
Training
Reranker models are trained offline using the training pipeline:- Generate training data (
training/reranker_data.py): Extracts feature vectors from benchmark examples - Train model (
training/reranker_trainer.py): Trains MLP and exports to ONNX with quantization - Evaluate: Compare MRR and Top-1 accuracy against the baseline ranker
Performance
- Latency: ~0.5ms per candidate batch (ONNX optimized)
- Memory: ~2-5MB model size (quantized)
- Dependencies: Requires
onnxruntime(pip install myspellchecker[ai])
See Also
- Suggestion Ranking — Rule-based ranking pipeline
- Suggestion Strategy — Candidate generation
- Semantic Algorithm — MLM inference
- Training Guide — Model training pipeline