Skip to main content
While N-gram context checking is fast and effective, it has limitations (short context window, data sparsity). Semantic Validation uses deep learning (Transformers) to understand the meaning of the full sentence. Semantic Validation uses a Masked Language Model to detect deep context errors by masking each word and comparing model predictions against the original. This is the AI-powered strategy at priority 70 in the validation pipeline.

How It Works

mySpellChecker uses a Masked Language Model (MLM) approach, similar to BERT or RoBERTa.
  1. Masking: The system takes a sentence and hides the suspicious word.
    • Sentence: “မောင်မောင်က အောင်အောင်ကို လှမ်းပျော်လိုက်သည်။” (Maung Maung [ပျော် = happy, should be ပြော = speak] called out to Aung Aung).
    • Masked: “မောင်မောင်က အောင်အောင်ကို [MASK]လိုက်သည်။”
  2. Prediction: The AI model predicts the most likely words to fill the hole based on the entire sentence context.
    • Predictions: “ပြော” (Say - 99%), “ကြည့်” (Look - 0.5%)…
  3. Comparison:
    • The original word “ပျော်” (Happy) is contextually nonsense here (very low probability).
    • A phonetically similar neighbor “ပြော” (Say) has high probability.
    • The system flags this as a semantic error and suggests “ပြော”.

Architecture

  • Model Format: ONNX (Open Neural Network Exchange) for high-performance inference on CPU.
  • Tokenizer: HFTokenizerWrapper (adapts HuggingFace tokenizers like XLM-RoBERTa, mBERT) or custom tokenizer.json (via tokenizers library via RawTokenizersWrapper).
  • Optimization: The model is quantized (int8) to reduce size and increase speed.

Word-Aligned Multi-Token Masking

Myanmar words are frequently split into multiple BPE subword tokens. Standard BERT-style masking (one token at a time) fails for these words. The semantic checker implements word-aligned masking:

The Problem

Word: "ကျွန်တော်" → BPE tokens: ["▁ကျွန်", "တော်"]

Standard masking:  "... [MASK] တော် ..."
→ Model sees half the word, predictions are biased

Word-aligned masking: "... [MASK] [MASK] ..."
→ Model sees full context without the word

Alignment Algorithm

  1. Tokenize the full sentence
  2. Map each token to its character offset range using the tokenizer’s offset mapping
  3. Find all tokens whose character offsets overlap with the target word’s span
  4. Mask all those tokens simultaneously
The alignment results are cached (LRU 256 entries) since the same sentence/word combinations are often checked repeatedly.

Beam Search for Multi-Token Prediction

When multiple tokens are masked, the checker uses beam search to find the most likely complete word:
Masked positions: [pos_5, pos_6]

Step 1: Get top-K predictions for pos_5
        → [("▁ကျွန်", 0.8), ("▁အောင်", 0.1), ...]

Step 2: For each candidate at pos_5, get top-K at pos_6
        → ("▁ကျွန်", 0.8) × ("တော်", 0.9) = 0.72
        → ("▁ကျွန်", 0.8) × ("မ", 0.05) = 0.04
        → ...

Step 3: Select highest combined score
        → "ကျွန်တော်" with score 0.72

Decode combined tokens → final word prediction
This avoids the “diagonal selection” bug where independently picking the best token at each masked position produces invalid word combinations.

Confidence Calibration

Different transformer architectures produce logits at different scales. The checker auto-detects the model family and applies appropriate scaling:
Model FamilyLogit ScaleDetection
XLM-RoBERTa10.0Model name contains “xlm-roberta”
XLM10.0Model name contains “xlm”
DistilBERT30.0Model name contains “distil”
ALBERT40.0Model name contains “albert” (matched before “bert”)
BERT / mBERT50.0Model name contains “bert”
RoBERTa (non-XLM)15.0Model name contains “roberta” (matched after “xlm-roberta”)
Other10.0Default fallback
The logit scale converts raw model logits to a [0, 1] confidence score. Override with logit_scale in SemanticConfig if needed.

Inference Backends

The semantic checker supports two backends through the inference_backends.py adapter:

ONNX Runtime (Default)

semantic = SemanticChecker(
    model_path="./model/model.onnx",
    tokenizer_path="./model/tokenizer.json",
    num_threads=4,
)

PyTorch Fallback

semantic = SemanticChecker(
    model_path="xlm-roberta-base",
    tokenizer_path="xlm-roberta-base",
    use_pytorch=True,
)
Note: GPU device selection is configured via SemanticConfig.device (e.g., "cuda:0"), not on the SemanticChecker constructor directly.

Tokenizer Wrappers

Two tokenizer adapters provide a unified interface:
  • HFTokenizerWrapper: Wraps HuggingFace AutoTokenizer (for models like XLM-RoBERTa, mBERT)
  • RawTokenizersWrapper: Wraps the tokenizers library format (for custom tokenizer.json files)
Both expose: encode(text), decode(ids), token_to_id(token), get_offsets(text)

Training Your Own Model

Since generic models may not cover your specific domain (e.g., medical, legal), mySpellChecker provides a built-in training pipeline. You can train a custom model on your own text corpus without needing a GPU cluster or cloud API.
1

Install Training Tools

pip install "myspellchecker[train]"
2

Prepare Data

Create a simple text file (corpus.txt) with one sentence per line.
မောင်မောင် ကျောင်းသွားသည်။
သူ စာကြိုးစားသည်။
...
3

Train

Use the train-model CLI command. This handles tokenization, training (RoBERTa), and ONNX export automatically.
myspellchecker train-model \
    --input corpus.txt \
    --output ./my_semantic_model \
    --epochs 5
4

Result

The ./my_semantic_model folder will contain:
  • model.onnx: The optimized AI model.
  • tokenizer.json: The custom vocabulary.

Usage

Prerequisites

pip install "myspellchecker[ai]"

Configuration

You can load the model using file paths or pass pre-loaded objects. Option A: File Paths (Simple)
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./my_semantic_model/model.onnx",
        tokenizer_path="./my_semantic_model/tokenizer.json",
    ),
    use_context_checker=True
)
checker = SpellChecker(config=config)
Option B: Pre-loaded Objects (Advanced)
import onnxruntime as ort
from tokenizers import Tokenizer
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

model = ort.InferenceSession("./my_semantic_model/model.onnx")
tokenizer = Tokenizer.from_file("./my_semantic_model/tokenizer.json")

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model=model,
        tokenizer=tokenizer,
    ),
    use_context_checker=True
)
checker = SpellChecker(config=config)

SemanticChecker Methods

MethodDescription
predict_mask(sentence, target_word, top_k, occurrence)Predict most likely words at a masked position
is_semantic_error(sentence, word, neighbors)Check if word is a semantic error; returns suggestion or None
scan_sentence(sentence, words)Proactively scan all words in sentence for semantic anomalies
batch_get_mask_logits(sentence, target_words)Batch inference for multiple masked positions in one forward pass
score_mask_candidates(sentence, target_word, candidates)Score specific candidate tokens at a masked position
score_candidates(sentence, target_word, candidates)Combined scoring with logits + frequency
clear_inference_cache()Clear cached inference results
has_cached_logits(sentence, target_word)Check if logits are already cached
cache_stats()Return encoding/alignment/logit cache statistics
close()Release ONNX session resources

Performance Considerations

  • Latency: Neural network inference is slower than N-gram lookup (~50ms - 150ms on CPU).
  • Strategy: Use Semantic Validation when accuracy is paramount (e.g., final proofreading, offline batch processing).

See Also