Skip to main content
While N-gram context checking is fast and effective, it has limitations (short context window, data sparsity). Semantic Validation uses deep learning (Transformers) to understand the meaning of the full sentence.
Two AI approaches: mySpellChecker has two AI-powered strategies:
  • Semantic Validation (this page): MLM-based, masks each word, provides suggestions (~200ms)
  • Error Detection: Token classification, single forward pass, detects errors only (~10ms)
Semantic Validation is more thorough; Error Detection is faster. They can be used together.

How It Works

mySpellChecker uses a Masked Language Model (MLM) approach, similar to BERT or RoBERTa.
  1. Masking: The system takes a sentence and hides the suspicious word.
    • Sentence: “မောင်မောင်က အောင်အောင်ကို လှမ်းပျော်လိုက်သည်။” (Maung Maung [happy] Aung Aung).
    • Masked: “မောင်မောင်က အောင်အောင်ကို [MASK]လိုက်သည်။”
  2. Prediction: The AI model predicts the most likely words to fill the hole based on the entire sentence context.
    • Predictions: “ပြော” (Say - 99%), “ကြည့်” (Look - 0.5%)…
  3. Comparison:
    • The original word “ပျော်” (Happy) is contextually nonsense here (very low probability).
    • A phonetically similar neighbor “ပြော” (Say) has high probability.
    • The system flags this as a semantic error and suggests “ပြော”.

Architecture

  • Model Format: ONNX (Open Neural Network Exchange) for high-performance inference on CPU.
  • Tokenizer: HFTokenizerWrapper (adapts HuggingFace tokenizers like XLM-RoBERTa, mBERT) or custom tokenizer.json (via tokenizers library).
  • Optimization: The model is quantized (int8) to reduce size and increase speed.

Training Your Own Model

Since generic models may not cover your specific domain (e.g., medical, legal), mySpellChecker provides a built-in training pipeline. You can train a custom model on your own text corpus without needing a GPU cluster or cloud API.
1

Install Training Tools

pip install "myspellchecker[train]"
2

Prepare Data

Create a simple text file (corpus.txt) with one sentence per line.
မောင်မောင် ကျောင်းသွားသည်။
သူ စာကြိုးစားသည်။
...
3

Train

Use the train-model CLI command. This handles tokenization, training (RoBERTa), and ONNX export automatically.
myspellchecker train-model \
    --input corpus.txt \
    --output ./my_semantic_model \
    --epochs 5
4

Result

The ./my_semantic_model folder will contain:
  • model.onnx: The optimized AI model.
  • tokenizer.json: The custom vocabulary.

Usage

Prerequisites

pip install "myspellchecker[ai]"

Configuration

You can load the model using file paths or pass pre-loaded objects. Option A: File Paths (Simple)
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./my_semantic_model/model.onnx",
        tokenizer_path="./my_semantic_model/tokenizer.json",
    ),
    use_context_checker=True
)
checker = SpellChecker(config=config)
Option B: Pre-loaded Objects (Advanced)
import onnxruntime as ort
from tokenizers import Tokenizer
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

model = ort.InferenceSession("./my_semantic_model/model.onnx")
tokenizer = Tokenizer.from_file("./my_semantic_model/tokenizer.json")

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model=model,
        tokenizer=tokenizer,
    ),
    use_context_checker=True
)
checker = SpellChecker(config=config)

Performance Considerations

  • Latency: Neural network inference is slower than N-gram lookup (~50ms - 150ms on CPU).
  • Strategy: Use Semantic Validation when accuracy is paramount (e.g., final proofreading, offline batch processing).
While we don’t have a standalone Semantic Model demo (as it requires training a model first), the Context Aware Demo illustrates the principles of context checking. To adapt that example for Semantic Validation:
  1. Train your model using myspellchecker train-model.
  2. Update the SpellCheckerConfig in the example script to include a SemanticConfig with model_path and tokenizer_path.
  3. The check() call remains exactly the same (level="word"), but the results will now include AI-powered suggestions!

See Also