Skip to main content
The benchmark suite evaluates mySpellChecker’s end-to-end accuracy (detection recall, suggestion quality, false positive rate, and latency) across 1,138 hand-annotated Myanmar sentences organized into difficulty tiers and 6 domains.

Current Results

Tested with mySpellChecker_production.db (565 MB, 601K words, full POS + enrichment tables) on macOS Apple Silicon, validation level word.
The dictionary database and semantic model (v2.3) used in these benchmarks are not included in the library. They were built from our own proprietary corpus using the data pipeline and training pipeline respectively. Your results will vary depending on the dictionary database you build and the semantic model you train. See Building Dictionaries and Training Models to create your own.

Overall Metrics (no semantic)

MetricValue
F196.2%
Precision97.8%
Recall94.7%
True Positives445
False Positives10
False Negatives25
FPR (clean sentences)0.0%
Top-1 Suggestion Accuracy85.2%
MRR0.8731

Overall Metrics (with semantic v2.3)

MetricValue
F198.3%
Precision97.1%
Recall99.6%
True Positives468
False Positives14
False Negatives2
FPR (clean sentences)0.0%
Top-1 Suggestion Accuracy81.2%
MRR0.8395

Per-Tier Breakdown (no semantic)

TierErrorsTPFPFNPrecRecF1Top-1MRR
Tier 1 (Easy)1601523898.1%95.0%96.5%85.5%0.876
Tier 2 (Medium)1641571799.4%95.7%97.5%86.0%0.883
Tier 3 (Hard)14613661095.8%93.2%94.4%83.8%0.859

Benchmark Dataset

Sentence Distribution

TierSentencesWhat It Tests
Tier 1 (Easy)182Invalid syllable structure
Tier 2 (Medium)379Valid syllable, wrong word
Tier 3 (Hard)133Valid word, wrong in context
Clean444False positive resistance
The benchmark is defined in benchmarks/myspellchecker_benchmark.yaml (1,138 sentences, 564 error spans) covering 6 domains: conversational, academic, technical, news, religious, and literary.

Composite Score Formula

composite = 0.30 * F1
          + 0.25 * MRR
          + 0.20 * (1 - FPR)
          + 0.15 * Top1_Accuracy
          + 0.10 * (1 - latency_normalized)
Where latency_normalized = min(p95 / 500ms, 1.0).

Running Benchmarks

Basic Run

# Run with production database
python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db

# Run with semantic model
python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db \
  --semantic /path/to/semantic-model/

# JSON output only (for automation)
python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db \
  --json-only

Key Flags

FlagDescription
--dbPath to spell checker database (required)
--benchmarkBenchmark YAML path (default: myspellchecker_benchmark.yaml)
--levelValidation level: syllable or word (default: word)
--semanticPath to ONNX semantic model directory
--rerankerPath to neural MLP reranker directory
--nerEnable NER-based FP suppression
--json-onlyOutput JSON only, no human-readable summary
--debug-strategy-gatesEnable per-strategy gate telemetry

Ablation Runs

Disable targeted rule groups to measure their impact:
python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db \
  --disable-targeted-rerank-hints \
  --disable-targeted-candidate-injections \
  --disable-targeted-grammar-completion-templates \
  --json-only

Utility Scripts

Run Comparison

Compare two benchmark run artifacts to track regressions:
python benchmarks/compare_runs.py \
  --baseline run_a.json \
  --current run_b.json \
  --output-json comparison.json \
  --output-md comparison.md

Rule Auditing

Audit targeted rerank rules from telemetry data:
python benchmarks/audit_targeted_rules.py \
  --reports run.json \
  --output-json audit.json \
  --output-md audit.md

Ablation Matrix

Run full ablation study (default + each group off + all off):
python benchmarks/run_ablation.py \
  --db data/mySpellChecker_production.db \
  --level word \
  --semantic /path/to/semantic-model/ \
  --output-dir ablation_results/

Semantic Model Evaluation

Head-to-head model comparison (confusable discrimination, logit analysis, perplexity):
python benchmarks/semantic_model_eval.py \
  --models v2.3=/path/to/v2.3-final \
  --db data/mySpellChecker_production.db

DB Query Profiling

Instrument SQLiteProvider to count and time every database call per sentence:
python benchmarks/profile_db_queries.py \
  --db data/mySpellChecker_production.db \
  --output profile_report.json

Known Limitations

  1. 10 residual FPs: false positives on edge-case constructions, documented and accepted.
  2. 25 FNs without semantic: context-dependent errors requiring MLM; semantic model rescues 23 of 25.
  3. Suggestion quality plateau: remaining rank>1 cases are inherent morpheme/compound ambiguity where the same error pattern has conflicting gold corrections.

See Also