Overview
mySpellChecker provides two training pipelines:1. Semantic Model (MLM) Training
Trains a custom Masked Language Model for semantic validation:| Stage | Output | Purpose |
|---|---|---|
| Tokenizer | tokenizer.json | Byte-Level BPE tokenizer for Myanmar |
| Model Training | PyTorch checkpoint | Masked Language Model |
| ONNX Export | model.onnx | Optimized inference model |
2. Neural Reranker Training
Trains a small MLP to re-rank spell checker suggestions using learned feature weights:| Stage | Output | Purpose |
|---|---|---|
| Data Generation | reranker_training.jsonl | 19-feature vectors per candidate with gold labels |
| MLP Training | PyTorch checkpoint | Listwise cross-entropy (ListMLE) scorer |
| ONNX Export | reranker.onnx + stats.json | Quantized model + feature normalization stats |
Prerequisites
Install the training dependencies:torch- PyTorch for model trainingtransformers- HuggingFace Transformers for model architecturestokenizers- Fast tokenizer libraryonnx- ONNX export supportonnxruntime- ONNX inference runtime
Quick Start
The simplest way to train a model:Model Architectures
The training pipeline supports two transformer architectures:RoBERTa (Default)
RoBERTa (Robustly Optimized BERT Pretraining Approach) is recommended for most use cases:- Dynamic masking during training
- No Next Sentence Prediction (NSP) objective
- Larger batch sizes and more training data typically improve results
BERT
BERT (Bidirectional Encoder Representations from Transformers):- Static masking
- Includes NSP objective capability
- Well-suited for tasks requiring sentence-pair understanding
Configuration Options
TrainingConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input_file | str | Required | Path to training corpus (one sentence per line) |
output_dir | str | Required | Directory to save model and artifacts |
vocab_size | int | 30,000 | Vocabulary size for BPE tokenizer |
min_frequency | int | 2 | Minimum frequency for token inclusion |
epochs | int | 5 | Number of training epochs |
batch_size | int | 16 | Batch size per device |
learning_rate | float | 5e-5 | Peak learning rate |
hidden_size | int | 256 | Size of hidden layers |
num_layers | int | 4 | Number of transformer layers |
num_heads | int | 4 | Number of attention heads |
max_length | int | 128 | Maximum sequence length |
architecture | str | ”roberta” | Model architecture (“roberta” or “bert”) |
resume_from_checkpoint | str | None | Path to checkpoint directory to resume from |
warmup_ratio | float | 0.1 | Ratio of steps for learning rate warmup |
weight_decay | float | 0.01 | Weight decay for optimizer |
save_metrics | bool | True | Save training metrics to JSON file |
keep_checkpoints | bool | False | Keep intermediate checkpoints |
streaming | bool | False | Use streaming dataset for large corpora |
checkpoint_dir | str | None | Persistent checkpoint directory (e.g., /opt/ml/checkpoints on SageMaker) |
max_steps | int | None | Cap total training steps (overrides epoch-based) |
word_boundary_aware | bool | False | Use word-boundary-aware masking |
whole_word_masking | bool | False | Mask entire words instead of subwords |
pos_file | str | None | POS tag file for POS-aware masking |
denoising_ratio | float | 0.0 | Ratio of denoising corruption (0 = disabled) |
fp16 | bool | False | Use mixed-precision (FP16) training |
gradient_accumulation_steps | int | 1 | Steps to accumulate before optimizer step |
lr_scheduler_type | str | ”linear” | Learning rate scheduler type |
corruption_ratio | float | 0.0 | Ratio of input corruption for denoising |
confusable_masking | bool | False | Use confusable-aware masking (requires whole_word_masking=True) |
confusable_mask_ratio | float | 0.3 | Ratio of masks replaced with confusable words |
confusable_words_file | str | None | Path to confusable words list |
embedding_surgery | bool | False | Enable embedding surgery for domain adaptation |
embedding_warmup_steps | int | 25,000 | Warmup steps for embedding surgery |
embedding_lr | float | 1e-3 | Learning rate for embedding layers during surgery |
Architecture Constraints
Thehidden_size must be divisible by num_heads. Valid combinations include:
- hidden_size=256, num_heads=4 (64 per head)
- hidden_size=256, num_heads=8 (32 per head)
- hidden_size=512, num_heads=8 (64 per head)
Learning Rate Scheduling
The training pipeline uses linear learning rate scheduling with warmup:- Starts at 0
- Linearly increases to
learning_rateoverwarmup_ratio * total_steps - Linearly decreases to 0 over remaining steps
Resume Training from Checkpoint
Training can be resumed from a checkpoint if interrupted:Training Metrics
Whensave_metrics=True (default), training metrics are saved to training_metrics.json:
step: Global training stepepoch: Current epoch (fractional)loss: Training losslearning_rate: Current learning rate
Low-Level API
For more control, useModelTrainer directly:
ONNX Export
Models are automatically exported to ONNX format with INT8 quantization:SemanticChecker for context-aware validation.
Output Files:
model.onnx- Quantized model (default)model.base.onnx- Original FP32 modeltokenizer.json- Copied for convenience
Using Trained Models
With SemanticChecker
Standalone Inference
CLI Usage
Train a model via CLI:Corpus Format
The training corpus should be a text file with one sentence per line:- UTF-8 encoding
- One sentence per line
- Minimum 100 lines (recommended: 10,000+ lines)
- Segmented text (spaces between words) works best
GPU Support
Training automatically uses GPU if available:Batch Size by GPU VRAM
| GPU VRAM | Recommended batch_size |
|---|---|
| 4GB | 8 |
| 8GB | 16 |
| 16GB | 32 |
| 24GB+ | 64 |
Model Size vs Quality
| Configuration | Parameters | Quality | Speed |
|---|---|---|---|
| Small (default) | ~5M | Good | Fast |
| Medium | ~20M | Better | Medium |
| Large | ~100M | Best | Slow |
Best Practices
- Corpus Size: Use at least 10,000 sentences for meaningful results
- Batch Size: Larger batches (16-32) generally train faster on GPU
- Hidden Size: Start with 256 for small models, 512 for larger ones
- Epochs: 5-10 epochs is usually sufficient; monitor loss for overfitting
- Warmup: 10% warmup (0.1) helps training stability
- Checkpoints: Enable
keep_checkpoints=Truefor long training runs - Metrics: Always save metrics to monitor training progress
Troubleshooting
Memory Issues
Slow Training
Invalid hidden_size/num_heads
Neural Reranker Training
The neural reranker is a small MLP (Linear(19→64)→ReLU→Dropout→Linear(64→1), ~5K parameters) that learns to re-rank spell checker suggestions using 19 extracted features. It runs as the final step in the suggestion pipeline after N-gram and semantic reranking. See Neural Reranker for the full feature vector layout.Prerequisites
Requires:- A production SQLite database (built by the data pipeline)
- A segmented Arrow IPC corpus (produced during pipeline ingestion)
- PyTorch:
pip install myspellchecker[train]
Step 1: Generate Training Data
TheRerankerDataGenerator creates labeled training data by corrupting clean sentences and collecting spell checker candidates:
Step 2: Train the MLP
| Parameter | Default | Description |
|---|---|---|
epochs | 20 | Maximum training epochs |
lr | 1e-3 | Learning rate |
batch_size | 64 | Batch size |
patience | 5 | Early stopping patience (on validation Top-1 accuracy) |
val_ratio | 0.2 | Validation split ratio |
hidden_dim | 64 | MLP hidden layer dimension |
dropout | 0.1 | Dropout rate |
max_candidates | 20 | Maximum candidates per example |
Step 3: Use the Trained Model
See Also
- Semantic Checking: Using trained models for context validation
- Semantic Algorithm: Deep dive into the MLM approach
- Suggestion Ranking: Neural reranker integration
- CLI Reference:
train-modelcommand details - Configuration Guide: SemanticConfig and NeuralRerankerConfig options