Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Pre-trained Myanmar language models are scarce and rarely cover domain-specific vocabulary. Instead of shipping generic models that underperform, mySpellChecker provides training pipelines so you can build models tuned to your exact corpus, handling tokenizer creation, model training, and ONNX export end-to-end.
Overview
mySpellChecker provides two training pipelines:
1. Semantic Model (MLM) Training
Trains a custom Masked Language Model for semantic validation:
Raw Text → Tokenizer Training → Model Training → ONNX Export
| Stage | Output | Purpose |
|---|
| Tokenizer | tokenizer.json | Byte-Level BPE tokenizer for Myanmar |
| Model Training | PyTorch checkpoint | Masked Language Model |
| ONNX Export | model.onnx | Optimized inference model |
2. Neural Reranker Training
Trains a small MLP to re-rank spell checker suggestions using learned feature weights:
Arrow Corpus → Synthetic Errors → Candidate Collection → Feature Extraction → Train MLP → ONNX Export
| Stage | Output | Purpose |
|---|
| Data Generation | reranker_training.jsonl | 19-feature vectors per candidate with gold labels |
| MLP Training | PyTorch checkpoint | Listwise cross-entropy (ListMLE) scorer |
| ONNX Export | reranker.onnx + stats.json | Quantized model + feature normalization stats |
Prerequisites
Install the training dependencies:
pip install myspellchecker[train]
This installs:
torch - PyTorch for model training
transformers - HuggingFace Transformers for model architectures
tokenizers - Fast tokenizer library
onnx - ONNX export support
onnxruntime - ONNX inference runtime
Quick Start
The simplest way to train a model:
from myspellchecker.training import TrainingPipeline, TrainingConfig
# Configure training
config = TrainingConfig(
input_file="corpus.txt", # One sentence per line
output_dir="./my_model",
epochs=5,
)
# Run training
pipeline = TrainingPipeline()
model_path = pipeline.run(config)
print(f"Model saved to: {model_path}")
Model Architectures
The training pipeline supports two transformer architectures:
RoBERTa (Default)
RoBERTa (Robustly Optimized BERT Pretraining Approach) is recommended for most use cases:
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./roberta_model",
architecture="roberta", # Default
)
Key characteristics:
- Dynamic masking during training
- No Next Sentence Prediction (NSP) objective
- Larger batch sizes and more training data typically improve results
BERT
BERT (Bidirectional Encoder Representations from Transformers):
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./bert_model",
architecture="bert",
)
Key characteristics:
- Static masking
- Includes NSP objective capability
- Well-suited for tasks requiring sentence-pair understanding
Configuration Options
TrainingConfig Parameters
| Parameter | Type | Default | Description |
|---|
input_file | str | Required | Path to training corpus (one sentence per line) |
output_dir | str | Required | Directory to save model and artifacts |
vocab_size | int | 30,000 | Vocabulary size for BPE tokenizer |
min_frequency | int | 2 | Minimum frequency for token inclusion |
epochs | int | 5 | Number of training epochs |
batch_size | int | 16 | Batch size per device |
learning_rate | float | 5e-5 | Peak learning rate |
hidden_size | int | 256 | Size of hidden layers |
num_layers | int | 4 | Number of transformer layers |
num_heads | int | 4 | Number of attention heads |
max_length | int | 128 | Maximum sequence length |
architecture | str | ”roberta” | Model architecture (“roberta” or “bert”) |
resume_from_checkpoint | str | None | Path to checkpoint directory to resume from |
warmup_ratio | float | 0.1 | Ratio of steps for learning rate warmup |
weight_decay | float | 0.01 | Weight decay for optimizer |
save_metrics | bool | True | Save training metrics to JSON file |
keep_checkpoints | bool | False | Keep intermediate checkpoints |
streaming | bool | False | Use streaming dataset for large corpora |
checkpoint_dir | str | None | Persistent checkpoint directory (e.g., /opt/ml/checkpoints on SageMaker) |
max_steps | int | None | Cap total training steps (overrides epoch-based) |
word_boundary_aware | bool | False | Use word-boundary-aware masking |
whole_word_masking | bool | False | Mask entire words instead of subwords |
pos_file | str | None | POS tag file for POS-aware masking |
denoising_ratio | float | 0.0 | Ratio of denoising corruption (0 = disabled) |
fp16 | bool | False | Use mixed-precision (FP16) training |
gradient_accumulation_steps | int | 1 | Steps to accumulate before optimizer step |
lr_scheduler_type | str | ”linear” | Learning rate scheduler type |
corruption_ratio | float | 0.0 | Ratio of input corruption for denoising |
confusable_masking | bool | False | Use confusable-aware masking (requires whole_word_masking=True) |
confusable_mask_ratio | float | 0.3 | Ratio of masks replaced with confusable words |
confusable_words_file | str | None | Path to confusable words list |
embedding_surgery | bool | False | Enable embedding surgery for domain adaptation |
embedding_warmup_steps | int | 25,000 | Warmup steps for embedding surgery |
embedding_lr | float | 1e-3 | Learning rate for embedding layers during surgery |
Architecture Constraints
The hidden_size must be divisible by num_heads. Valid combinations include:
- hidden_size=256, num_heads=4 (64 per head)
- hidden_size=256, num_heads=8 (32 per head)
- hidden_size=512, num_heads=8 (64 per head)
Learning Rate Scheduling
The training pipeline uses linear learning rate scheduling with warmup:
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./model",
learning_rate=5e-5, # Peak learning rate
warmup_ratio=0.1, # 10% of steps for warmup
weight_decay=0.01, # AdamW weight decay
)
The learning rate:
- Starts at 0
- Linearly increases to
learning_rate over warmup_ratio * total_steps
- Linearly decreases to 0 over remaining steps
Resume Training from Checkpoint
Training can be resumed from a checkpoint if interrupted:
# Initial training
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./model",
epochs=10,
keep_checkpoints=True, # Keep checkpoints for resume
)
pipeline = TrainingPipeline()
pipeline.run(config) # Interrupted at epoch 5
# Resume training
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./model",
epochs=10,
resume_from_checkpoint="./model/checkpoints/checkpoint-500",
)
pipeline.run(config) # Continues from checkpoint
Checkpoints are saved every 500 steps by default.
Training Metrics
When save_metrics=True (default), training metrics are saved to training_metrics.json:
[
{"step": 50, "epoch": 0.5, "loss": 8.234, "learning_rate": 2.5e-5},
{"step": 100, "epoch": 1.0, "loss": 6.891, "learning_rate": 5e-5},
...
]
Metrics include:
step: Global training step
epoch: Current epoch (fractional)
loss: Training loss
learning_rate: Current learning rate
Low-Level API
For more control, use ModelTrainer directly:
from myspellchecker.training import ModelTrainer, ModelArchitecture
trainer = ModelTrainer()
# Step 1: Train tokenizer
tokenizer_path = trainer.train_tokenizer(
corpus_path="corpus.txt",
output_dir="./tokenizer",
vocab_size=30000,
)
# Step 2: Train model
model_path = trainer.train_model(
corpus_path="corpus.txt",
tokenizer_path=tokenizer_path,
output_dir="./model",
architecture=ModelArchitecture.ROBERTA,
epochs=5,
warmup_ratio=0.1,
save_metrics=True,
)
ONNX Export
Models are automatically exported to ONNX format with INT8 quantization:
from myspellchecker.training import ONNXExporter
exporter = ONNXExporter()
exporter.export(
model_dir="./pytorch_model",
output_dir="./onnx_model",
quantize=True, # INT8 quantization
)
The exported ONNX model can be used with SemanticChecker for context-aware validation.
Output Files:
model.onnx - Quantized model (default)
model.base.onnx - Original FP32 model
tokenizer.json - Copied for convenience
Using Trained Models
With SemanticChecker
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig
config = SpellCheckerConfig(
semantic=SemanticConfig(
model_path="./models/model.onnx",
tokenizer_path="./models/tokenizer.json",
)
)
checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")
Standalone Inference
import onnxruntime as ort
from transformers import PreTrainedTokenizerFast
# Load model and tokenizer
session = ort.InferenceSession("./models/model.onnx")
tokenizer = PreTrainedTokenizerFast(tokenizer_file="./models/tokenizer.json")
# Prepare input
text = "မြန်မာ<mask>သည်"
inputs = tokenizer(text, return_tensors="np")
# Run inference
outputs = session.run(
["logits"],
{
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
}
)
CLI Usage
Train a model via CLI:
# Basic training
myspellchecker train-model -i corpus.txt -o ./models/
# With custom parameters
myspellchecker train-model -i corpus.txt -o ./models/ \
--architecture roberta \
--epochs 10 \
--hidden-size 512 \
--layers 6 \
--heads 8 \
--learning-rate 3e-5
# Resume from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
--resume ./models/checkpoints/checkpoint-500
The training corpus should be a text file with one sentence per line:
ကျွန်တော် မြန်မာ စာ လေ့လာ နေ ပါ တယ်
သူမ က စာအုပ် ဖတ် နေ တယ်
ဒီ နေ့ ရာသီ ဥတု ကောင်း တယ်
Requirements:
- UTF-8 encoding
- One sentence per line
- Minimum 100 lines (recommended: 10,000+ lines)
- Segmented text (spaces between words) works best
GPU Support
Training automatically uses GPU if available:
import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
Batch Size by GPU VRAM
| GPU VRAM | Recommended batch_size |
|---|
| 4GB | 8 |
| 8GB | 16 |
| 16GB | 32 |
| 24GB+ | 64 |
For CPU-only training:
# Training will automatically fall back to CPU if no GPU available
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./model",
batch_size=8, # Reduce batch size for CPU
)
Model Size vs Quality
| Configuration | Parameters | Quality | Speed |
|---|
| Small (default) | ~5M | Good | Fast |
| Medium | ~20M | Better | Medium |
| Large | ~100M | Best | Slow |
# Small (default)
config = TrainingConfig(hidden_size=256, num_layers=4, num_heads=4)
# Medium
config = TrainingConfig(hidden_size=512, num_layers=6, num_heads=8)
# Large
config = TrainingConfig(hidden_size=768, num_layers=12, num_heads=12)
Best Practices
- Corpus Size: Use at least 10,000 sentences for meaningful results
- Batch Size: Larger batches (16-32) generally train faster on GPU
- Hidden Size: Start with 256 for small models, 512 for larger ones
- Epochs: 5-10 epochs is usually sufficient; monitor loss for overfitting
- Warmup: 10% warmup (0.1) helps training stability
- Checkpoints: Enable
keep_checkpoints=True for long training runs
- Metrics: Always save metrics to monitor training progress
Troubleshooting
Memory Issues
# Reduce batch size and max_length
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./model",
batch_size=4,
max_length=64,
)
Slow Training
# Check GPU availability
import torch
print(torch.cuda.is_available())
# Reduce model complexity
config = TrainingConfig(
input_file="corpus.txt",
output_dir="./model",
hidden_size=128,
num_layers=2,
)
Invalid hidden_size/num_heads
# hidden_size must be divisible by num_heads
# This will raise ValueError:
config = TrainingConfig(
hidden_size=256,
num_heads=3, # Error: 256 not divisible by 3
)
# Valid configuration:
config = TrainingConfig(
hidden_size=256,
num_heads=4, # OK: 256 / 4 = 64
)
Neural Reranker Training
The neural reranker is a small MLP (Linear(19→64)→ReLU→Dropout→Linear(64→1), ~5K parameters) that learns to re-rank spell checker suggestions using 19 extracted features. It runs as the final step in the suggestion pipeline after N-gram and semantic reranking. See Neural Reranker for the full feature vector layout.
Prerequisites
Requires:
- A production SQLite database (built by the data pipeline)
- A segmented Arrow IPC corpus (produced during pipeline ingestion)
- PyTorch:
pip install myspellchecker[train]
Step 1: Generate Training Data
The RerankerDataGenerator creates labeled training data by corrupting clean sentences and collecting spell checker candidates:
from myspellchecker.training.reranker_data import RerankerDataGenerator
generator = RerankerDataGenerator(
db_path="data/mySpellChecker_production.db",
arrow_corpus_path="data/segmented_corpus.arrow",
)
# Generate training data (single-threaded)
generator.generate(
num_examples=100_000,
output_path="data/reranker_training.jsonl",
)
For large-scale generation, use the threaded entry point:
from myspellchecker.training.reranker_data import generate_threaded
stats = generate_threaded(
db_path="data/mySpellChecker_production.db",
arrow_corpus_path="data/segmented_corpus.arrow",
output_path="data/reranker_training_100k.jsonl",
num_examples=100_000,
)
Each JSONL line contains 19 features per candidate (edit distance, frequency, phonetic similarity, N-gram context, confusable status, source indicators, etc.) plus the gold correction index. See Neural Reranker for the full feature layout.
Step 2: Train the MLP
from myspellchecker.training.reranker_trainer import RerankerTrainer
trainer = RerankerTrainer("data/reranker_training.jsonl")
metrics = trainer.train(epochs=20)
# Export to ONNX
trainer.export_onnx("models/reranker-v1/reranker.onnx")
# Outputs: reranker.onnx + reranker.onnx.stats.json
Training parameters:
| Parameter | Default | Description |
|---|
epochs | 20 | Maximum training epochs |
lr | 1e-3 | Learning rate |
batch_size | 64 | Batch size |
patience | 5 | Early stopping patience (on validation Top-1 accuracy) |
val_ratio | 0.2 | Validation split ratio |
hidden_dim | 64 | MLP hidden layer dimension |
dropout | 0.1 | Dropout rate |
max_candidates | 20 | Maximum candidates per example |
CLI alternative:
python -m myspellchecker.training.reranker_trainer \
--train data/reranker_training_100k.jsonl \
--output models/reranker-v1/ \
--epochs 20 --lr 1e-3 --batch-size 64
Step 3: Use the Trained Model
from myspellchecker.core.config import SpellCheckerConfig, NeuralRerankerConfig
config = SpellCheckerConfig(
neural_reranker=NeuralRerankerConfig(
enabled=True,
model_path="models/reranker-v1/reranker.onnx",
stats_path="models/reranker-v1/reranker.onnx.stats.json",
),
)
See Neural Reranker for inference details and the feature vector specification.
See Also