Benchmark Suite - mySpellChecker

The benchmark suite evaluates mySpellChecker’s end-to-end accuracy (detection recall, suggestion quality, false positive rate, and latency) across 1,138 hand-annotated Myanmar sentences organized into difficulty tiers and 6 domains.

Current Results

Tested with mySpellChecker_production.db (565 MB, 601K words, full POS + enrichment tables) on macOS Apple Silicon, validation level word.

The dictionary database and semantic model (v2.3) used in these benchmarks are not included in the library. They were built from our own proprietary corpus using the data pipeline and training pipeline respectively. Your results will vary depending on the dictionary database you build and the semantic model you train. See Building Dictionaries and Training Models to create your own.

Overall Metrics (no semantic)

Metric	Value
F1	96.2%
Precision	97.8%
Recall	94.7%
True Positives	445
False Positives	10
False Negatives	25
FPR (clean sentences)	0.0%
Top-1 Suggestion Accuracy	85.2%
MRR	0.8731

Overall Metrics (with semantic v2.3)

Metric	Value
F1	98.3%
Precision	97.1%
Recall	99.6%
True Positives	468
False Positives	14
False Negatives	2
FPR (clean sentences)	0.0%
Top-1 Suggestion Accuracy	81.2%
MRR	0.8395

Per-Tier Breakdown (no semantic)

Tier	Errors	TP	FP	FN	Prec	Rec	F1	Top-1	MRR
Tier 1 (Easy)	160	152	3	8	98.1%	95.0%	96.5%	85.5%	0.876
Tier 2 (Medium)	164	157	1	7	99.4%	95.7%	97.5%	86.0%	0.883
Tier 3 (Hard)	146	136	6	10	95.8%	93.2%	94.4%	83.8%	0.859

Benchmark Dataset

Sentence Distribution

Tier	Sentences	What It Tests
Tier 1 (Easy)	182	Invalid syllable structure
Tier 2 (Medium)	379	Valid syllable, wrong word
Tier 3 (Hard)	133	Valid word, wrong in context
Clean	444	False positive resistance

The benchmark is defined in benchmarks/myspellchecker_benchmark.yaml (1,138 sentences, 564 error spans) covering 6 domains: conversational, academic, technical, news, religious, and literary.

Composite Score Formula

composite = 0.30 * F1
          + 0.25 * MRR
          + 0.20 * (1 - FPR)
          + 0.15 * Top1_Accuracy
          + 0.10 * (1 - latency_normalized)

Where latency_normalized = min(p95 / 500ms, 1.0).

Running Benchmarks

Basic Run

# Run with production database
python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db

# Run with semantic model
python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db \
  --semantic /path/to/semantic-model/

# JSON output only (for automation)
python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db \
  --json-only

Key Flags

Flag	Description
`--db`	Path to spell checker database (required)
`--benchmark`	Benchmark YAML path (default: `myspellchecker_benchmark.yaml`)
`--level`	Validation level: `syllable` or `word` (default: `word`)
`--semantic`	Path to ONNX semantic model directory
`--reranker`	Path to neural MLP reranker directory
`--ner`	Enable NER-based FP suppression
`--json-only`	Output JSON only, no human-readable summary
`--debug-strategy-gates`	Enable per-strategy gate telemetry

Ablation Runs

Disable targeted rule groups to measure their impact:

python benchmarks/run_benchmark.py \
  --db data/mySpellChecker_production.db \
  --disable-targeted-rerank-hints \
  --disable-targeted-candidate-injections \
  --disable-targeted-grammar-completion-templates \
  --json-only

Utility Scripts

Run Comparison

Compare two benchmark run artifacts to track regressions:

python benchmarks/compare_runs.py \
  --baseline run_a.json \
  --current run_b.json \
  --output-json comparison.json \
  --output-md comparison.md

Rule Auditing

Audit targeted rerank rules from telemetry data:

python benchmarks/audit_targeted_rules.py \
  --reports run.json \
  --output-json audit.json \
  --output-md audit.md

Ablation Matrix

Run full ablation study (default + each group off + all off):

python benchmarks/run_ablation.py \
  --db data/mySpellChecker_production.db \
  --level word \
  --semantic /path/to/semantic-model/ \
  --output-dir ablation_results/

Semantic Model Evaluation

Head-to-head model comparison (confusable discrimination, logit analysis, perplexity):

python benchmarks/semantic_model_eval.py \
  --models v2.3=/path/to/v2.3-final \
  --db data/mySpellChecker_production.db

DB Query Profiling

Instrument SQLiteProvider to count and time every database call per sentence:

python benchmarks/profile_db_queries.py \
  --db data/mySpellChecker_production.db \
  --output profile_report.json

Known Limitations

10 residual FPs: false positives on edge-case constructions, documented and accepted.
25 FNs without semantic: context-dependent errors requiring MLM; semantic model rescues 23 of 25.
Suggestion quality plateau: remaining rank>1 cases are inherent morpheme/compound ambiguity where the same error pattern has conflicting gold corrections.

​Current Results

​Overall Metrics (no semantic)

​Overall Metrics (with semantic v2.3)

​Per-Tier Breakdown (no semantic)

​Benchmark Dataset

​Sentence Distribution

​Composite Score Formula

​Running Benchmarks

​Basic Run

​Key Flags

​Ablation Runs

​Utility Scripts

​Run Comparison

​Rule Auditing

​Ablation Matrix

​Semantic Model Evaluation

​DB Query Profiling

​Known Limitations

​See Also

Current Results

Overall Metrics (no semantic)

Overall Metrics (with semantic v2.3)

Per-Tier Breakdown (no semantic)

Benchmark Dataset

Sentence Distribution

Composite Score Formula

Running Benchmarks

Basic Run

Key Flags

Ablation Runs

Utility Scripts

Run Comparison

Rule Auditing

Ablation Matrix

Semantic Model Evaluation

DB Query Profiling

Known Limitations

See Also