While N-gram context checking is fast and effective, it has limitations (short context window, data sparsity). Semantic Validation uses deep learning (Transformers) to understand the meaning of the full sentence. Semantic Validation uses a Masked Language Model to detect deep context errors by masking each word and comparing model predictions against the original. This is the AI-powered strategy at priority 70 in the validation pipeline.Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
How It Works
mySpellChecker uses a Masked Language Model (MLM) approach, similar to BERT or RoBERTa.- Masking: The system takes a sentence and hides the suspicious word.
- Sentence: “မောင်မောင်က အောင်အောင်ကို လှမ်းပျော်လိုက်သည်။” (Maung Maung [ပျော် = happy, should be ပြော = speak] called out to Aung Aung).
- Masked: “မောင်မောင်က အောင်အောင်ကို [MASK]လိုက်သည်။”
- Prediction: The AI model predicts the most likely words to fill the hole based on the entire sentence context.
- Predictions: “ပြော” (Say - 99%), “ကြည့်” (Look - 0.5%)…
- Comparison:
- The original word “ပျော်” (Happy) is contextually nonsense here (very low probability).
- A phonetically similar neighbor “ပြော” (Say) has high probability.
- The system flags this as a semantic error and suggests “ပြော”.
Architecture
- Model Format: ONNX (Open Neural Network Exchange) for high-performance inference on CPU.
- Tokenizer: HFTokenizerWrapper (adapts HuggingFace tokenizers like XLM-RoBERTa, mBERT) or custom tokenizer.json (via tokenizers library via RawTokenizersWrapper).
- Optimization: The model is quantized (int8) to reduce size and increase speed.
Word-Aligned Multi-Token Masking
Myanmar words are frequently split into multiple BPE subword tokens. Standard BERT-style masking (one token at a time) fails for these words. The semantic checker implements word-aligned masking:The Problem
Alignment Algorithm
- Tokenize the full sentence
- Map each token to its character offset range using the tokenizer’s offset mapping
- Find all tokens whose character offsets overlap with the target word’s span
- Mask all those tokens simultaneously
Beam Search for Multi-Token Prediction
When multiple tokens are masked, the checker uses beam search to find the most likely complete word:Confidence Calibration
Different transformer architectures produce logits at different scales. The checker auto-detects the model family and applies appropriate scaling:| Model Family | Logit Scale | Detection |
|---|---|---|
| XLM-RoBERTa | 10.0 | Model name contains “xlm-roberta” |
| XLM | 10.0 | Model name contains “xlm” |
| DistilBERT | 30.0 | Model name contains “distil” |
| ALBERT | 40.0 | Model name contains “albert” (matched before “bert”) |
| BERT / mBERT | 50.0 | Model name contains “bert” |
| RoBERTa (non-XLM) | 15.0 | Model name contains “roberta” (matched after “xlm-roberta”) |
| Other | 10.0 | Default fallback |
logit_scale in SemanticConfig if needed.
Inference Backends
The semantic checker supports two backends through theinference_backends.py adapter:
ONNX Runtime (Default)
PyTorch Fallback
Note: GPU device selection is configured viaSemanticConfig.device(e.g.,"cuda:0"), not on theSemanticCheckerconstructor directly.
Tokenizer Wrappers
Two tokenizer adapters provide a unified interface:- HFTokenizerWrapper: Wraps HuggingFace
AutoTokenizer(for models like XLM-RoBERTa, mBERT) - RawTokenizersWrapper: Wraps the
tokenizerslibrary format (for customtokenizer.jsonfiles)
encode(text), decode(ids), token_to_id(token), get_offsets(text)
Training Your Own Model
Since generic models may not cover your specific domain (e.g., medical, legal), mySpellChecker provides a built-in training pipeline. You can train a custom model on your own text corpus without needing a GPU cluster or cloud API.Train
Use the
train-model CLI command. This handles tokenization, training (RoBERTa), and ONNX export automatically.Usage
Prerequisites
Configuration
You can load the model using file paths or pass pre-loaded objects. Option A: File Paths (Simple)SemanticChecker Methods
| Method | Description |
|---|---|
predict_mask(sentence, target_word, top_k, occurrence) | Predict most likely words at a masked position |
is_semantic_error(sentence, word, neighbors) | Check if word is a semantic error; returns suggestion or None |
scan_sentence(sentence, words) | Proactively scan all words in sentence for semantic anomalies |
batch_get_mask_logits(sentence, target_words) | Batch inference for multiple masked positions in one forward pass |
score_mask_candidates(sentence, target_word, candidates) | Score specific candidate tokens at a masked position |
score_candidates(sentence, target_word, candidates) | Combined scoring with logits + frequency |
clear_inference_cache() | Clear cached inference results |
has_cached_logits(sentence, target_word) | Check if logits are already cached |
cache_stats() | Return encoding/alignment/logit cache statistics |
close() | Release ONNX session resources |
Performance Considerations
- Latency: Neural network inference is slower than N-gram lookup (~50ms - 150ms on CPU).
- Strategy: Use Semantic Validation when accuracy is paramount (e.g., final proofreading, offline batch processing).
See Also
- Semantic Checking Feature: Feature-level documentation and configuration
- Training Guide: Training semantic MLM and neural reranker models