Semantic Validation (MLM) - mySpellChecker

While N-gram context checking is fast and effective, it has limitations (short context window, data sparsity). Semantic Validation uses deep learning (Transformers) to understand the meaning of the full sentence. Semantic Validation uses a Masked Language Model to detect deep context errors by masking each word and comparing model predictions against the original. This is the AI-powered strategy at priority 70 in the validation pipeline.

How It Works

mySpellChecker uses a Masked Language Model (MLM) approach, similar to BERT or RoBERTa.

Masking: The system takes a sentence and hides the suspicious word.
- Sentence: “မောင်မောင်က အောင်အောင်ကို လှမ်းပျော်လိုက်သည်။” (Maung Maung [ပျော် = happy, should be ပြော = speak] called out to Aung Aung).
- Masked: “မောင်မောင်က အောင်အောင်ကို [MASK]လိုက်သည်။”
Prediction: The AI model predicts the most likely words to fill the hole based on the entire sentence context.
- Predictions: “ပြော” (Say - 99%), “ကြည့်” (Look - 0.5%)…
Comparison:
- The original word “ပျော်” (Happy) is contextually nonsense here (very low probability).
- A phonetically similar neighbor “ပြော” (Say) has high probability.
- The system flags this as a semantic error and suggests “ပြော”.

Architecture

Model Format: ONNX (Open Neural Network Exchange) for high-performance inference on CPU.
Tokenizer: HFTokenizerWrapper (adapts HuggingFace tokenizers like XLM-RoBERTa, mBERT) or custom tokenizer.json (via tokenizers library via RawTokenizersWrapper).
Optimization: The model is quantized (int8) to reduce size and increase speed.

Word-Aligned Multi-Token Masking

Myanmar words are frequently split into multiple BPE subword tokens. Standard BERT-style masking (one token at a time) fails for these words. The semantic checker implements word-aligned masking:

The Problem

Word: "ကျွန်တော်" → BPE tokens: ["▁ကျွန်", "တော်"]

Standard masking:  "... [MASK] တော် ..."
→ Model sees half the word, predictions are biased

Word-aligned masking: "... [MASK] [MASK] ..."
→ Model sees full context without the word

Alignment Algorithm

Tokenize the full sentence
Map each token to its character offset range using the tokenizer’s offset mapping
Find all tokens whose character offsets overlap with the target word’s span
Mask all those tokens simultaneously

The alignment results are cached (LRU 256 entries) since the same sentence/word combinations are often checked repeatedly.

Beam Search for Multi-Token Prediction

When multiple tokens are masked, the checker uses beam search to find the most likely complete word:

Masked positions: [pos_5, pos_6]

Step 1: Get top-K predictions for pos_5
        → [("▁ကျွန်", 0.8), ("▁အောင်", 0.1), ...]

Step 2: For each candidate at pos_5, get top-K at pos_6
        → ("▁ကျွန်", 0.8) × ("တော်", 0.9) = 0.72
        → ("▁ကျွန်", 0.8) × ("မ", 0.05) = 0.04
        → ...

Step 3: Select highest combined score
        → "ကျွန်တော်" with score 0.72

Decode combined tokens → final word prediction

This avoids the “diagonal selection” bug where independently picking the best token at each masked position produces invalid word combinations.

Confidence Calibration

Different transformer architectures produce logits at different scales. The checker auto-detects the model family and applies appropriate scaling:

Model Family	Logit Scale	Detection
XLM-RoBERTa	10.0	Model name contains “xlm-roberta”
XLM	10.0	Model name contains “xlm”
DistilBERT	30.0	Model name contains “distil”
ALBERT	40.0	Model name contains “albert” (matched before “bert”)
BERT / mBERT	50.0	Model name contains “bert”
RoBERTa (non-XLM)	15.0	Model name contains “roberta” (matched after “xlm-roberta”)
Other	10.0	Default fallback

The logit scale converts raw model logits to a [0, 1] confidence score. Override with logit_scale in SemanticConfig if needed.

Inference Backends

The semantic checker supports two backends through the inference_backends.py adapter:

ONNX Runtime (Default)

semantic = SemanticChecker(
    model_path="./model/model.onnx",
    tokenizer_path="./model/tokenizer.json",
    num_threads=4,
)

PyTorch Fallback

semantic = SemanticChecker(
    model_path="xlm-roberta-base",
    tokenizer_path="xlm-roberta-base",
    use_pytorch=True,
)

Note: GPU device selection is configured via SemanticConfig.device (e.g., "cuda:0"), not on the SemanticChecker constructor directly.

Tokenizer Wrappers

Two tokenizer adapters provide a unified interface:

HFTokenizerWrapper: Wraps HuggingFace AutoTokenizer (for models like XLM-RoBERTa, mBERT)
RawTokenizersWrapper: Wraps the tokenizers library format (for custom tokenizer.json files)

Both expose: encode(text), decode(ids), token_to_id(token), get_offsets(text)

Training Your Own Model

Since generic models may not cover your specific domain (e.g., medical, legal), mySpellChecker provides a built-in training pipeline. You can train a custom model on your own text corpus without needing a GPU cluster or cloud API.

Install Training Tools

pip install "myspellchecker[train]"

Prepare Data

Create a simple text file (corpus.txt) with one sentence per line.

မောင်မောင် ကျောင်းသွားသည်။
သူ စာကြိုးစားသည်။
...

Train

Use the train-model CLI command. This handles tokenization, training (RoBERTa), and ONNX export automatically.

myspellchecker train-model \
    --input corpus.txt \
    --output ./my_semantic_model \
    --epochs 5

Result

The ./my_semantic_model folder will contain:

model.onnx: The optimized AI model.
tokenizer.json: The custom vocabulary.

Usage

Prerequisites

pip install "myspellchecker[ai]"

Configuration

You can load the model using file paths or pass pre-loaded objects. Option A: File Paths (Simple)

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./my_semantic_model/model.onnx",
        tokenizer_path="./my_semantic_model/tokenizer.json",
    ),
    use_context_checker=True
)
checker = SpellChecker(config=config)

Option B: Pre-loaded Objects (Advanced)

import onnxruntime as ort
from tokenizers import Tokenizer
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

model = ort.InferenceSession("./my_semantic_model/model.onnx")
tokenizer = Tokenizer.from_file("./my_semantic_model/tokenizer.json")

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model=model,
        tokenizer=tokenizer,
    ),
    use_context_checker=True
)
checker = SpellChecker(config=config)

SemanticChecker Methods

Method	Description
`predict_mask(sentence, target_word, top_k, occurrence)`	Predict most likely words at a masked position
`is_semantic_error(sentence, word, neighbors)`	Check if word is a semantic error; returns suggestion or None
`scan_sentence(sentence, words)`	Proactively scan all words in sentence for semantic anomalies
`batch_get_mask_logits(sentence, target_words)`	Batch inference for multiple masked positions in one forward pass
`score_mask_candidates(sentence, target_word, candidates)`	Score specific candidate tokens at a masked position
`score_candidates(sentence, target_word, candidates)`	Combined scoring with logits + frequency
`clear_inference_cache()`	Clear cached inference results
`has_cached_logits(sentence, target_word)`	Check if logits are already cached
`cache_stats()`	Return encoding/alignment/logit cache statistics
`close()`	Release ONNX session resources

Performance Considerations

Latency: Neural network inference is slower than N-gram lookup (~50ms - 150ms on CPU).
Strategy: Use Semantic Validation when accuracy is paramount (e.g., final proofreading, offline batch processing).

​How It Works

​Architecture

​Word-Aligned Multi-Token Masking

​The Problem

​Alignment Algorithm

​Beam Search for Multi-Token Prediction

​Confidence Calibration

​Inference Backends

​ONNX Runtime (Default)

​PyTorch Fallback

​Tokenizer Wrappers

​Training Your Own Model

​Usage

​Prerequisites

​Configuration

​SemanticChecker Methods

​Performance Considerations

​See Also