Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The benchmark suite evaluates mySpellChecker’s end-to-end accuracy (detection recall, suggestion quality, false positive rate, and latency) across 1,138 hand-annotated Myanmar sentences organized into difficulty tiers and 6 domains.
Current Results
Tested with mySpellChecker_production.db (565 MB, 601K words, full POS + enrichment tables) on macOS Apple Silicon, validation level word.
The dictionary database and semantic model (v2.3) used in these benchmarks are not included in the library. They were built from our own proprietary corpus using the data pipeline and training pipeline respectively. Your results will vary depending on the dictionary database you build and the semantic model you train. See Building Dictionaries and Training Models to create your own.
Overall Metrics (no semantic)
| Metric | Value |
|---|
| F1 | 96.2% |
| Precision | 97.8% |
| Recall | 94.7% |
| True Positives | 445 |
| False Positives | 10 |
| False Negatives | 25 |
| FPR (clean sentences) | 0.0% |
| Top-1 Suggestion Accuracy | 85.2% |
| MRR | 0.8731 |
Overall Metrics (with semantic v2.3)
| Metric | Value |
|---|
| F1 | 98.3% |
| Precision | 97.1% |
| Recall | 99.6% |
| True Positives | 468 |
| False Positives | 14 |
| False Negatives | 2 |
| FPR (clean sentences) | 0.0% |
| Top-1 Suggestion Accuracy | 81.2% |
| MRR | 0.8395 |
Per-Tier Breakdown (no semantic)
| Tier | Errors | TP | FP | FN | Prec | Rec | F1 | Top-1 | MRR |
|---|
| Tier 1 (Easy) | 160 | 152 | 3 | 8 | 98.1% | 95.0% | 96.5% | 85.5% | 0.876 |
| Tier 2 (Medium) | 164 | 157 | 1 | 7 | 99.4% | 95.7% | 97.5% | 86.0% | 0.883 |
| Tier 3 (Hard) | 146 | 136 | 6 | 10 | 95.8% | 93.2% | 94.4% | 83.8% | 0.859 |
Benchmark Dataset
Sentence Distribution
| Tier | Sentences | What It Tests |
|---|
| Tier 1 (Easy) | 182 | Invalid syllable structure |
| Tier 2 (Medium) | 379 | Valid syllable, wrong word |
| Tier 3 (Hard) | 133 | Valid word, wrong in context |
| Clean | 444 | False positive resistance |
The benchmark is defined in benchmarks/myspellchecker_benchmark.yaml (1,138 sentences, 564 error spans) covering 6 domains: conversational, academic, technical, news, religious, and literary.
composite = 0.30 * F1
+ 0.25 * MRR
+ 0.20 * (1 - FPR)
+ 0.15 * Top1_Accuracy
+ 0.10 * (1 - latency_normalized)
Where latency_normalized = min(p95 / 500ms, 1.0).
Running Benchmarks
Basic Run
# Run with production database
python benchmarks/run_benchmark.py \
--db data/mySpellChecker_production.db
# Run with semantic model
python benchmarks/run_benchmark.py \
--db data/mySpellChecker_production.db \
--semantic /path/to/semantic-model/
# JSON output only (for automation)
python benchmarks/run_benchmark.py \
--db data/mySpellChecker_production.db \
--json-only
Key Flags
| Flag | Description |
|---|
--db | Path to spell checker database (required) |
--benchmark | Benchmark YAML path (default: myspellchecker_benchmark.yaml) |
--level | Validation level: syllable or word (default: word) |
--semantic | Path to ONNX semantic model directory |
--reranker | Path to neural MLP reranker directory |
--ner | Enable NER-based FP suppression |
--json-only | Output JSON only, no human-readable summary |
--debug-strategy-gates | Enable per-strategy gate telemetry |
Ablation Runs
Disable targeted rule groups to measure their impact:
python benchmarks/run_benchmark.py \
--db data/mySpellChecker_production.db \
--disable-targeted-rerank-hints \
--disable-targeted-candidate-injections \
--disable-targeted-grammar-completion-templates \
--json-only
Utility Scripts
Run Comparison
Compare two benchmark run artifacts to track regressions:
python benchmarks/compare_runs.py \
--baseline run_a.json \
--current run_b.json \
--output-json comparison.json \
--output-md comparison.md
Rule Auditing
Audit targeted rerank rules from telemetry data:
python benchmarks/audit_targeted_rules.py \
--reports run.json \
--output-json audit.json \
--output-md audit.md
Ablation Matrix
Run full ablation study (default + each group off + all off):
python benchmarks/run_ablation.py \
--db data/mySpellChecker_production.db \
--level word \
--semantic /path/to/semantic-model/ \
--output-dir ablation_results/
Semantic Model Evaluation
Head-to-head model comparison (confusable discrimination, logit analysis, perplexity):
python benchmarks/semantic_model_eval.py \
--models v2.3=/path/to/v2.3-final \
--db data/mySpellChecker_production.db
DB Query Profiling
Instrument SQLiteProvider to count and time every database call per sentence:
python benchmarks/profile_db_queries.py \
--db data/mySpellChecker_production.db \
--output profile_report.json
Known Limitations
- 10 residual FPs: false positives on edge-case constructions, documented and accepted.
- 25 FNs without semantic: context-dependent errors requiring MLM; semantic model rescues 23 of 25.
- Suggestion quality plateau: remaining rank>1 cases are inherent morpheme/compound ambiguity where the same error pattern has conflicting gold corrections.
See Also