Semantic Checking - mySpellChecker

Semantic checking goes beyond N-gram statistics by using BERT/RoBERTa masked language models to evaluate whether each word fits its surrounding context. This is an opt-in feature — mySpellChecker does not ship with pre-trained models. You train your own model on a Myanmar corpus using the training pipeline, then point the config at your exported ONNX file.

Two AI approaches: mySpellChecker offers two AI strategies:

Semantic Checking (this page): MLM-based, masks each word → N forward passes (~200ms). Provides suggestions.

Error Detection: Token classification, single forward pass (~10ms). Detects errors only.

Use Error Detection for speed, Semantic Checking for suggestions, or both together.

How It Works

Masked Language Modeling

Semantic checking uses BERT/RoBERTa-style masked language modeling:

# Input: "ထမင်း [MASK] ပြီ"
# Model predicts: {"စား": 0.85, "သွား": 0.03, "ရေး": 0.02, ...}

# If actual word is "သွား" but model strongly prefers "စား":
# → Flag as potential context error

Confidence Scoring

The model provides confidence scores for:

Original word fit: How well the word fits the context
Alternative suggestions: Better-fitting words
Error probability: Likelihood that original is wrong

result = {
    "original": "သွား",
    "original_prob": 0.03,
    "best_alternative": "စား",
    "best_prob": 0.85,
    "error_probability": 0.82,
}

Architecture

Model Architecture

Input Text → Tokenizer → BERT/RoBERTa → MLM Head → Predictions
                              ↓
                        Context Vectors

Components:

Tokenizer: Myanmar-optimized WordPiece/BPE
Encoder: BERT/RoBERTa transformer (6-12 layers)
MLM Head: Masked language model prediction head

ONNX Runtime

For production, models are exported to ONNX for:

Faster inference: Optimized runtime
CPU efficiency: INT8 quantization
Portability: No PyTorch dependency

# Model sizes:
# Full precision: ~400MB
# INT8 quantized: ~100MB
# INT4 quantized: ~50MB

Configuration

Enable Semantic Checking

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="/path/to/model.onnx",
        tokenizer_path="/path/to/tokenizer",
    )
)
checker = SpellChecker(config=config)

Advanced Configuration

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        # Model paths
        model_path="/path/to/model.onnx",
        tokenizer_path="/path/to/tokenizer",

        # Or provide pre-loaded instances
        model=loaded_model,
        tokenizer=loaded_tokenizer,

        # Inference settings
        num_threads=1,  # ONNX threads (CPU only, default: 1)
        predict_top_k=5,  # Top K predictions for suggestions
        check_top_k=10,  # Top K tokens to check for errors

        # Behavior settings
        use_semantic_refinement=True,  # Refine N-gram detection with AI
        use_proactive_scanning=False,  # Scan for errors independently (requires good model)
        proactive_confidence_threshold=0.5,  # Confidence for proactive detection
        scoring_confidence_threshold=0.3,  # Confidence for suggestion ranking

        # Backend settings
        use_pytorch=False,  # Force PyTorch backend instead of ONNX
        device="cpu",  # Device for inference ("cpu" or "cuda:0")
        logit_scale=None,  # Override auto-detected logit scale (None = auto)

        # Validation settings
        validate_model_architecture=True,  # Verify model architecture on load
        myanmar_text_ratio_threshold=0.5,  # Minimum Myanmar text ratio for processing
        word_alignment_enabled=True,  # Align subword predictions to word boundaries
    )
)

Graceful Degradation

If semantic checking fails to initialize, the checker continues without it:

# If model loading fails:
# - Logs warning
# - Returns None for semantic checker
# - Spell checking continues with other layers

Using Semantic Checking

Basic Usage

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig
from myspellchecker.core.constants import ValidationLevel

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./my-model/model.onnx",       # Your trained model
        tokenizer_path="./my-model/tokenizer.json", # Your trained tokenizer
    )
)

checker = SpellChecker(config=config)

# Enable semantic checking per-call
result = checker.check("ထမင်းသွား", level=ValidationLevel.WORD, use_semantic=True)

# Semantic checker provides deep context analysis
for error in result.errors:
    if error.error_type == "semantic_error":
        print(f"Semantic error: {error.text}")
        print(f"Confidence: {error.confidence}")

Direct API Access

from myspellchecker.algorithms.semantic_checker import SemanticChecker

semantic = SemanticChecker(
    model_path="./my-model/model.onnx",
    tokenizer_path="./my-model/tokenizer.json",
)

# Get predictions for a word position (mask and predict)
sentence = "ထမင်း သွား ပြီ"
target_word = "သွား"
predictions = semantic.predict_mask(sentence, target_word, top_k=5)
for word, prob in predictions:
    print(f"{word}: {prob:.3f}")

# Check if a word is a semantic error
neighbors = ["စား", "လာ", "ထွက်"]  # Phonetically similar words
suggestion = semantic.is_semantic_error(sentence, target_word, neighbors)
if suggestion:
    print(f"Semantic error! Suggested: {suggestion}")
else:
    print("Word appears correct in context")

Sentence Scanning

# Scan entire sentence for semantic errors
sentence = "ထမင်း သွား ပြီ"
words = ["ထမင်း", "သွား", "ပြီ"]

errors = semantic.scan_sentence(
    sentence=sentence,
    words=words,
    min_word_len=2,
    confidence_threshold=0.3,
)

for idx, word, suggestions, confidence in errors:
    print(f"Word '{word}' at index {idx}: suggestions={suggestions}")

Training Your Model

You need to train your own model before using semantic checking. The quickest path:

# Train semantic model on your Myanmar corpus
myspellchecker train-model \
    --input corpus.txt \
    --output ./my-model/ \
    --architecture roberta \
    --hidden-size 256 \
    --layers 6 \
    --epochs 10

This trains a tokenizer, model, and exports to ONNX in one step. See the Training Guide for full configuration options, corpus requirements, and GPU setup.

Manual ONNX Export

The train-model command automatically exports to ONNX format. For manual export or re-quantization, use the Python API:

from myspellchecker.training import ONNXExporter

exporter = ONNXExporter()
exporter.export(
    model_dir="./my-model/",
    output_dir="./my-model/onnx/",
    quantize=True,  # Enable INT8 quantization (default: True)
)

Performance Characteristics

Metric	Value
Inference Time	~200ms per text (CPU)
Inference Time	~20ms per text (GPU)
Model Size	100-400MB
Memory Usage	200-500MB
Accuracy	~95% for context errors

Benchmarks

These numbers are representative of a model trained with the default configuration on a ~50K sentence corpus:

Configuration: 6 layers, 256 hidden, INT8 quantized
Hardware: CPU (8 cores)

Single text inference: 180ms
Batch (32 texts): 850ms (26.5ms/text)
Throughput: 37.6 texts/second

Actual accuracy depends on your corpus quality, domain coverage, and model size. Train and evaluate on your own data.

Model Size Guide

These are example training configurations you can use with TrainingConfig. Results vary depending on corpus size and domain.

Configuration	Layers	Hidden	Approx. Size	Speed	Use Case
Tiny	3	128	~25MB	~50ms	Mobile / edge deployment
Small (default)	4	256	~100MB	~180ms	General-purpose
Base	12	512	~400MB	~350ms	Maximum accuracy

Common Patterns

Conditional Semantic Checking

def check_with_semantic_fallback(text: str, threshold: float = 0.5) -> dict:
    """Use semantic checking only for uncertain cases."""
    from myspellchecker.core.constants import ValidationLevel

    # First pass: word-level context checking
    checker = SpellChecker()
    result = checker.check(text, level=ValidationLevel.WORD)

    # Second pass: semantic for low-confidence errors
    uncertain_errors = [
        e for e in result.errors
        if e.confidence < threshold
    ]

    if uncertain_errors and semantic_checker:
        sentence = text
        words = [e.text for e in uncertain_errors]
        semantic_results = semantic_checker.scan_sentence(sentence, words)
        # Update confidence based on semantic analysis

    return result

Caching Predictions

from functools import lru_cache

@lru_cache(maxsize=10000)
def cached_semantic_check(context_tuple: tuple) -> dict:
    """Cache semantic predictions for common contexts."""
    left, word, right = context_tuple
    sentence = f"{left} {word} {right}"
    return semantic.is_semantic_error(sentence, word, [])

GPU Batch Processing

def process_large_corpus(sentences: list[tuple[str, list[str]]], batch_size: int = 64) -> list:
    """Process large corpus with GPU batching.

    Args:
        sentences: List of (sentence, words) tuples for scanning.
        batch_size: Number of sentences per batch.
    """
    all_errors = []

    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]
        for sentence, words in batch:
            errors = semantic.scan_sentence(sentence, words)
            all_errors.extend(errors)

    return all_errors

Troubleshooting

Issue: Model loading fails

Cause: Missing model files or incompatible version Solution:

# Check model files exist
import os
assert os.path.exists("./my-model/model.onnx")
assert os.path.exists("./my-model/tokenizer.json")

# Verify ONNX Runtime version
import onnxruntime
print(onnxruntime.__version__)

Issue: Slow inference

Cause: Unquantized model or insufficient threads Solution:

# Use quantized model (train-model exports quantized by default)
config = SemanticConfig(
    model_path="./my-model/model.onnx",  # Quantized ONNX
    num_threads=8,  # More CPU threads
)

Issue: Poor accuracy

Cause: Model not trained on similar data or corpus too small Solution: Retrain with a larger or more domain-specific corpus, or use a bigger architecture:

myspellchecker train-model \
    --input domain_corpus.txt \
    --output ./my-model-v2/ \
    --hidden-size 512 \
    --layers 6 \
    --epochs 10

Issue: High memory usage

Cause: Large model architecture Solution: Train a smaller model:

# Train a tiny model for constrained environments
myspellchecker train-model \
    --input corpus.txt \
    --output ./my-model-tiny/ \
    --hidden-size 128 \
    --layers 3

Next Steps

Training Guide - Required — train your own model before using semantic checking
Error Detection - Faster AI validation (~10ms single-pass)
Semantic Algorithm - Deep dive into the MLM approach
Performance Tuning - Optimize inference speed

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​How It Works

​Masked Language Modeling

​Confidence Scoring

​Architecture

​Model Architecture

​ONNX Runtime

​Configuration

​Enable Semantic Checking

​Advanced Configuration

​Graceful Degradation

​Using Semantic Checking

​Basic Usage

​Direct API Access

​Sentence Scanning

​Training Your Model

​Manual ONNX Export

​Performance Characteristics

​Benchmarks

​Model Size Guide

​Common Patterns

​Conditional Semantic Checking

​Caching Predictions

​GPU Batch Processing

​Troubleshooting

​Issue: Model loading fails

​Issue: Slow inference

​Issue: Poor accuracy

​Issue: High memory usage

​Next Steps

How It Works

Masked Language Modeling

Confidence Scoring

Architecture

Model Architecture

ONNX Runtime

Configuration

Enable Semantic Checking

Advanced Configuration

Graceful Degradation

Using Semantic Checking

Basic Usage

Direct API Access

Sentence Scanning

Training Your Model

Manual ONNX Export

Performance Characteristics

Benchmarks

Model Size Guide

Common Patterns

Conditional Semantic Checking

Caching Predictions

GPU Batch Processing

Troubleshooting

Issue: Model loading fails

Issue: Slow inference

Issue: Poor accuracy

Issue: High memory usage

Next Steps