How It Works
mySpellChecker uses a Masked Language Model (MLM) approach, similar to BERT or RoBERTa.- Masking: The system takes a sentence and hides the suspicious word.
- Sentence: “မောင်မောင်က အောင်အောင်ကို လှမ်းပျော်လိုက်သည်။” (Maung Maung [ပျော် = happy, should be ပြော = speak] called out to Aung Aung).
- Masked: “မောင်မောင်က အောင်အောင်ကို [MASK]လိုက်သည်။”
- Prediction: The AI model predicts the most likely words to fill the hole based on the entire sentence context.
- Predictions: “ပြော” (Say - 99%), “ကြည့်” (Look - 0.5%)…
- Comparison:
- The original word “ပျော်” (Happy) is contextually nonsense here (very low probability).
- A phonetically similar neighbor “ပြော” (Say) has high probability.
- The system flags this as a semantic error and suggests “ပြော”.
Architecture
- Model Format: ONNX (Open Neural Network Exchange) for high-performance inference on CPU.
- Tokenizer: HFTokenizerWrapper (adapts HuggingFace tokenizers like XLM-RoBERTa, mBERT) or custom tokenizer.json (via tokenizers library via RawTokenizersWrapper).
- Optimization: The model is quantized (int8) to reduce size and increase speed.
Word-Aligned Multi-Token Masking
Myanmar words are frequently split into multiple BPE subword tokens. Standard BERT-style masking (one token at a time) fails for these words. The semantic checker implements word-aligned masking:The Problem
Alignment Algorithm
- Tokenize the full sentence
- Map each token to its character offset range using the tokenizer’s offset mapping
- Find all tokens whose character offsets overlap with the target word’s span
- Mask all those tokens simultaneously
Beam Search for Multi-Token Prediction
When multiple tokens are masked, the checker uses beam search to find the most likely complete word:Confidence Calibration
Different transformer architectures produce logits at different scales. The checker auto-detects the model family and applies appropriate scaling:| Model Family | Logit Scale | Detection |
|---|---|---|
| XLM-RoBERTa | 10.0 | Model name contains “xlm-roberta” |
| XLM | 10.0 | Model name contains “xlm” |
| DistilBERT | 30.0 | Model name contains “distil” |
| ALBERT | 40.0 | Model name contains “albert” (matched before “bert”) |
| BERT / mBERT | 50.0 | Model name contains “bert” |
| RoBERTa (non-XLM) | 15.0 | Model name contains “roberta” (matched after “xlm-roberta”) |
| Other | 10.0 | Default fallback |
logit_scale in SemanticConfig if needed.
Inference Backends
The semantic checker supports two backends through theinference_backends.py adapter:
ONNX Runtime (Default)
PyTorch Fallback
Note: GPU device selection is configured viaSemanticConfig.device(e.g.,"cuda:0"), not on theSemanticCheckerconstructor directly.
Tokenizer Wrappers
Two tokenizer adapters provide a unified interface:- HFTokenizerWrapper: Wraps HuggingFace
AutoTokenizer(for models like XLM-RoBERTa, mBERT) - RawTokenizersWrapper: Wraps the
tokenizerslibrary format (for customtokenizer.jsonfiles)
encode(text), decode(ids), token_to_id(token), get_offsets(text)
Training Your Own Model
Since generic models may not cover your specific domain (e.g., medical, legal), mySpellChecker provides a built-in training pipeline. You can train a custom model on your own text corpus without needing a GPU cluster or cloud API.Train
Use the
train-model CLI command. This handles tokenization, training (RoBERTa), and ONNX export automatically.Usage
Prerequisites
Configuration
You can load the model using file paths or pass pre-loaded objects. Option A: File Paths (Simple)SemanticChecker Methods
| Method | Description |
|---|---|
predict_mask(sentence, target_word, top_k, occurrence) | Predict most likely words at a masked position |
is_semantic_error(sentence, word, neighbors) | Check if word is a semantic error; returns suggestion or None |
scan_sentence(sentence, words) | Proactively scan all words in sentence for semantic anomalies |
batch_get_mask_logits(sentence, target_words) | Batch inference for multiple masked positions in one forward pass |
score_mask_candidates(sentence, target_word, candidates) | Score specific candidate tokens at a masked position |
score_candidates(sentence, target_word, candidates) | Combined scoring with logits + frequency |
clear_inference_cache() | Clear cached inference results |
has_cached_logits(sentence, target_word) | Check if logits are already cached |
cache_stats() | Return encoding/alignment/logit cache statistics |
close() | Release ONNX session resources |
Performance Considerations
- Latency: Neural network inference is slower than N-gram lookup (~50ms - 150ms on CPU).
- Strategy: Use Semantic Validation when accuracy is paramount (e.g., final proofreading, offline batch processing).
See Also
- Semantic Checking Feature: Feature-level documentation and configuration
- Training Guide: Training semantic MLM and neural reranker models