Semantic checking goes beyond N-gram statistics by using BERT/RoBERTa masked language models to evaluate whether each word fits its surrounding context. This is an opt-in feature — mySpellChecker does not ship with pre-trained models. You train your own model on a Myanmar corpus using the training pipeline, then point the config at your exported ONNX file.
Two AI approaches: mySpellChecker offers two AI strategies:
- Semantic Checking (this page): MLM-based, masks each word → N forward passes (~200ms). Provides suggestions.
- Error Detection: Token classification, single forward pass (~10ms). Detects errors only.
Use Error Detection for speed, Semantic Checking for suggestions, or both together.
How It Works
Masked Language Modeling
Semantic checking uses BERT/RoBERTa-style masked language modeling:
# Input: "ထမင်း [MASK] ပြီ"
# Model predicts: {"စား": 0.85, "သွား": 0.03, "ရေး": 0.02, ...}
# If actual word is "သွား" but model strongly prefers "စား":
# → Flag as potential context error
Confidence Scoring
The model provides confidence scores for:
- Original word fit: How well the word fits the context
- Alternative suggestions: Better-fitting words
- Error probability: Likelihood that original is wrong
result = {
"original": "သွား",
"original_prob": 0.03,
"best_alternative": "စား",
"best_prob": 0.85,
"error_probability": 0.82,
}
Architecture
Model Architecture
Input Text → Tokenizer → BERT/RoBERTa → MLM Head → Predictions
↓
Context Vectors
Components:
- Tokenizer: Myanmar-optimized WordPiece/BPE
- Encoder: BERT/RoBERTa transformer (6-12 layers)
- MLM Head: Masked language model prediction head
ONNX Runtime
For production, models are exported to ONNX for:
- Faster inference: Optimized runtime
- CPU efficiency: INT8 quantization
- Portability: No PyTorch dependency
# Model sizes:
# Full precision: ~400MB
# INT8 quantized: ~100MB
# INT4 quantized: ~50MB
Configuration
Enable Semantic Checking
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig
config = SpellCheckerConfig(
semantic=SemanticConfig(
model_path="/path/to/model.onnx",
tokenizer_path="/path/to/tokenizer",
)
)
checker = SpellChecker(config=config)
Advanced Configuration
config = SpellCheckerConfig(
semantic=SemanticConfig(
# Model paths
model_path="/path/to/model.onnx",
tokenizer_path="/path/to/tokenizer",
# Or provide pre-loaded instances
model=loaded_model,
tokenizer=loaded_tokenizer,
# Inference settings
num_threads=1, # ONNX threads (CPU only, default: 1)
predict_top_k=5, # Top K predictions for suggestions
check_top_k=10, # Top K tokens to check for errors
# Behavior settings
use_semantic_refinement=True, # Refine N-gram detection with AI
use_proactive_scanning=False, # Scan for errors independently (requires good model)
proactive_confidence_threshold=0.5, # Confidence for proactive detection
scoring_confidence_threshold=0.3, # Confidence for suggestion ranking
# Backend settings
use_pytorch=False, # Force PyTorch backend instead of ONNX
device="cpu", # Device for inference ("cpu" or "cuda:0")
logit_scale=None, # Override auto-detected logit scale (None = auto)
# Validation settings
validate_model_architecture=True, # Verify model architecture on load
myanmar_text_ratio_threshold=0.5, # Minimum Myanmar text ratio for processing
word_alignment_enabled=True, # Align subword predictions to word boundaries
)
)
Graceful Degradation
If semantic checking fails to initialize, the checker continues without it:
# If model loading fails:
# - Logs warning
# - Returns None for semantic checker
# - Spell checking continues with other layers
Using Semantic Checking
Basic Usage
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig
from myspellchecker.core.constants import ValidationLevel
config = SpellCheckerConfig(
semantic=SemanticConfig(
model_path="./my-model/model.onnx", # Your trained model
tokenizer_path="./my-model/tokenizer.json", # Your trained tokenizer
)
)
checker = SpellChecker(config=config)
# Enable semantic checking per-call
result = checker.check("ထမင်းသွား", level=ValidationLevel.WORD, use_semantic=True)
# Semantic checker provides deep context analysis
for error in result.errors:
if error.error_type == "semantic_error":
print(f"Semantic error: {error.text}")
print(f"Confidence: {error.confidence}")
Direct API Access
from myspellchecker.algorithms.semantic_checker import SemanticChecker
semantic = SemanticChecker(
model_path="./my-model/model.onnx",
tokenizer_path="./my-model/tokenizer.json",
)
# Get predictions for a word position (mask and predict)
sentence = "ထမင်း သွား ပြီ"
target_word = "သွား"
predictions = semantic.predict_mask(sentence, target_word, top_k=5)
for word, prob in predictions:
print(f"{word}: {prob:.3f}")
# Check if a word is a semantic error
neighbors = ["စား", "လာ", "ထွက်"] # Phonetically similar words
suggestion = semantic.is_semantic_error(sentence, target_word, neighbors)
if suggestion:
print(f"Semantic error! Suggested: {suggestion}")
else:
print("Word appears correct in context")
Sentence Scanning
# Scan entire sentence for semantic errors
sentence = "ထမင်း သွား ပြီ"
words = ["ထမင်း", "သွား", "ပြီ"]
errors = semantic.scan_sentence(
sentence=sentence,
words=words,
min_word_len=2,
confidence_threshold=0.3,
)
for idx, word, suggestions, confidence in errors:
print(f"Word '{word}' at index {idx}: suggestions={suggestions}")
Training Your Model
You need to train your own model before using semantic checking. The quickest path:
# Train semantic model on your Myanmar corpus
myspellchecker train-model \
--input corpus.txt \
--output ./my-model/ \
--architecture roberta \
--hidden-size 256 \
--layers 6 \
--epochs 10
This trains a tokenizer, model, and exports to ONNX in one step. See the Training Guide for full configuration options, corpus requirements, and GPU setup.
Manual ONNX Export
The train-model command automatically exports to ONNX format. For manual export
or re-quantization, use the Python API:
from myspellchecker.training import ONNXExporter
exporter = ONNXExporter()
exporter.export(
model_dir="./my-model/",
output_dir="./my-model/onnx/",
quantize=True, # Enable INT8 quantization (default: True)
)
| Metric | Value |
|---|
| Inference Time | ~200ms per text (CPU) |
| Inference Time | ~20ms per text (GPU) |
| Model Size | 100-400MB |
| Memory Usage | 200-500MB |
| Accuracy | ~95% for context errors |
Benchmarks
These numbers are representative of a model trained with the default configuration on a ~50K sentence corpus:
Configuration: 6 layers, 256 hidden, INT8 quantized
Hardware: CPU (8 cores)
Single text inference: 180ms
Batch (32 texts): 850ms (26.5ms/text)
Throughput: 37.6 texts/second
Actual accuracy depends on your corpus quality, domain coverage, and model size. Train and evaluate on your own data.
Model Size Guide
These are example training configurations you can use with TrainingConfig. Results vary depending on corpus size and domain.
| Configuration | Layers | Hidden | Approx. Size | Speed | Use Case |
|---|
| Tiny | 3 | 128 | ~25MB | ~50ms | Mobile / edge deployment |
| Small (default) | 4 | 256 | ~100MB | ~180ms | General-purpose |
| Base | 12 | 512 | ~400MB | ~350ms | Maximum accuracy |
Common Patterns
Conditional Semantic Checking
def check_with_semantic_fallback(text: str, threshold: float = 0.5) -> dict:
"""Use semantic checking only for uncertain cases."""
from myspellchecker.core.constants import ValidationLevel
# First pass: word-level context checking
checker = SpellChecker()
result = checker.check(text, level=ValidationLevel.WORD)
# Second pass: semantic for low-confidence errors
uncertain_errors = [
e for e in result.errors
if e.confidence < threshold
]
if uncertain_errors and semantic_checker:
sentence = text
words = [e.text for e in uncertain_errors]
semantic_results = semantic_checker.scan_sentence(sentence, words)
# Update confidence based on semantic analysis
return result
Caching Predictions
from functools import lru_cache
@lru_cache(maxsize=10000)
def cached_semantic_check(context_tuple: tuple) -> dict:
"""Cache semantic predictions for common contexts."""
left, word, right = context_tuple
sentence = f"{left} {word} {right}"
return semantic.is_semantic_error(sentence, word, [])
GPU Batch Processing
def process_large_corpus(sentences: list[tuple[str, list[str]]], batch_size: int = 64) -> list:
"""Process large corpus with GPU batching.
Args:
sentences: List of (sentence, words) tuples for scanning.
batch_size: Number of sentences per batch.
"""
all_errors = []
for i in range(0, len(sentences), batch_size):
batch = sentences[i:i + batch_size]
for sentence, words in batch:
errors = semantic.scan_sentence(sentence, words)
all_errors.extend(errors)
return all_errors
Troubleshooting
Issue: Model loading fails
Cause: Missing model files or incompatible version
Solution:
# Check model files exist
import os
assert os.path.exists("./my-model/model.onnx")
assert os.path.exists("./my-model/tokenizer.json")
# Verify ONNX Runtime version
import onnxruntime
print(onnxruntime.__version__)
Issue: Slow inference
Cause: Unquantized model or insufficient threads
Solution:
# Use quantized model (train-model exports quantized by default)
config = SemanticConfig(
model_path="./my-model/model.onnx", # Quantized ONNX
num_threads=8, # More CPU threads
)
Issue: Poor accuracy
Cause: Model not trained on similar data or corpus too small
Solution: Retrain with a larger or more domain-specific corpus, or use a bigger architecture:
myspellchecker train-model \
--input domain_corpus.txt \
--output ./my-model-v2/ \
--hidden-size 512 \
--layers 6 \
--epochs 10
Issue: High memory usage
Cause: Large model architecture
Solution: Train a smaller model:
# Train a tiny model for constrained environments
myspellchecker train-model \
--input corpus.txt \
--output ./my-model-tiny/ \
--hidden-size 128 \
--layers 3
Next Steps