Current Results
Tested withmySpellChecker_production.db (565 MB, 601K words, full POS + enrichment tables) on macOS Apple Silicon, validation level word.
The dictionary database and semantic model (v2.3) used in these benchmarks are not included in the library. They were built from our own proprietary corpus using the data pipeline and training pipeline respectively. Your results will vary depending on the dictionary database you build and the semantic model you train. See Building Dictionaries and Training Models to create your own.
Overall Metrics (no semantic)
| Metric | Value |
|---|---|
| F1 | 96.2% |
| Precision | 97.8% |
| Recall | 94.7% |
| True Positives | 445 |
| False Positives | 10 |
| False Negatives | 25 |
| FPR (clean sentences) | 0.0% |
| Top-1 Suggestion Accuracy | 85.2% |
| MRR | 0.8731 |
Overall Metrics (with semantic v2.3)
| Metric | Value |
|---|---|
| F1 | 98.3% |
| Precision | 97.1% |
| Recall | 99.6% |
| True Positives | 468 |
| False Positives | 14 |
| False Negatives | 2 |
| FPR (clean sentences) | 0.0% |
| Top-1 Suggestion Accuracy | 81.2% |
| MRR | 0.8395 |
Per-Tier Breakdown (no semantic)
| Tier | Errors | TP | FP | FN | Prec | Rec | F1 | Top-1 | MRR |
|---|---|---|---|---|---|---|---|---|---|
| Tier 1 (Easy) | 160 | 152 | 3 | 8 | 98.1% | 95.0% | 96.5% | 85.5% | 0.876 |
| Tier 2 (Medium) | 164 | 157 | 1 | 7 | 99.4% | 95.7% | 97.5% | 86.0% | 0.883 |
| Tier 3 (Hard) | 146 | 136 | 6 | 10 | 95.8% | 93.2% | 94.4% | 83.8% | 0.859 |
Benchmark Dataset
Sentence Distribution
| Tier | Sentences | What It Tests |
|---|---|---|
| Tier 1 (Easy) | 182 | Invalid syllable structure |
| Tier 2 (Medium) | 379 | Valid syllable, wrong word |
| Tier 3 (Hard) | 133 | Valid word, wrong in context |
| Clean | 444 | False positive resistance |
benchmarks/myspellchecker_benchmark.yaml (1,138 sentences, 564 error spans) covering 6 domains: conversational, academic, technical, news, religious, and literary.
Composite Score Formula
latency_normalized = min(p95 / 500ms, 1.0).
Running Benchmarks
Basic Run
Key Flags
| Flag | Description |
|---|---|
--db | Path to spell checker database (required) |
--benchmark | Benchmark YAML path (default: myspellchecker_benchmark.yaml) |
--level | Validation level: syllable or word (default: word) |
--semantic | Path to ONNX semantic model directory |
--reranker | Path to neural MLP reranker directory |
--ner | Enable NER-based FP suppression |
--json-only | Output JSON only, no human-readable summary |
--debug-strategy-gates | Enable per-strategy gate telemetry |
Ablation Runs
Disable targeted rule groups to measure their impact:Utility Scripts
Run Comparison
Compare two benchmark run artifacts to track regressions:Rule Auditing
Audit targeted rerank rules from telemetry data:Ablation Matrix
Run full ablation study (default + each group off + all off):Semantic Model Evaluation
Head-to-head model comparison (confusable discrimination, logit analysis, perplexity):DB Query Profiling
Instrument SQLiteProvider to count and time every database call per sentence:Known Limitations
- 10 residual FPs: false positives on edge-case constructions, documented and accepted.
- 25 FNs without semantic: context-dependent errors requiring MLM; semantic model rescues 23 of 25.
- Suggestion quality plateau: remaining rank>1 cases are inherent morpheme/compound ambiguity where the same error pattern has conflicting gold corrections.
See Also
- Testing Guide - Unit, integration, and e2e tests
- Performance Tuning - Runtime optimization strategies
- Training Guide - Training semantic MLM and neural reranker models