Training Custom Models - mySpellChecker

Pre-trained Myanmar language models are scarce and rarely cover domain-specific vocabulary. Instead of shipping generic models that underperform, mySpellChecker provides training pipelines so you can build models tuned to your exact corpus, handling tokenizer creation, model training, and ONNX export end-to-end.

Overview

mySpellChecker provides two training pipelines:

1. Semantic Model (MLM) Training

Trains a custom Masked Language Model for semantic validation:

Raw Text → Tokenizer Training → Model Training → ONNX Export

Stage	Output	Purpose
Tokenizer	`tokenizer.json`	Byte-Level BPE tokenizer for Myanmar
Model Training	PyTorch checkpoint	Masked Language Model
ONNX Export	`model.onnx`	Optimized inference model

2. Neural Reranker Training

Trains a small MLP to re-rank spell checker suggestions using learned feature weights:

Arrow Corpus → Synthetic Errors → Candidate Collection → Feature Extraction → Train MLP → ONNX Export

Stage	Output	Purpose
Data Generation	`reranker_training.jsonl`	19-feature vectors per candidate with gold labels
MLP Training	PyTorch checkpoint	Listwise cross-entropy (ListMLE) scorer
ONNX Export	`reranker.onnx` + `stats.json`	Quantized model + feature normalization stats

Prerequisites

Install the training dependencies:

pip install myspellchecker[train]

This installs:

torch - PyTorch for model training
transformers - HuggingFace Transformers for model architectures
tokenizers - Fast tokenizer library
onnx - ONNX export support
onnxruntime - ONNX inference runtime

Quick Start

The simplest way to train a model:

from myspellchecker.training import TrainingPipeline, TrainingConfig

# Configure training
config = TrainingConfig(
    input_file="corpus.txt",  # One sentence per line
    output_dir="./my_model",
    epochs=5,
)

# Run training
pipeline = TrainingPipeline()
model_path = pipeline.run(config)
print(f"Model saved to: {model_path}")

Model Architectures

The training pipeline supports two transformer architectures:

RoBERTa (Default)

RoBERTa (Robustly Optimized BERT Pretraining Approach) is recommended for most use cases:

config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./roberta_model",
    architecture="roberta",  # Default
)

Key characteristics:

Dynamic masking during training
No Next Sentence Prediction (NSP) objective
Larger batch sizes and more training data typically improve results

BERT

BERT (Bidirectional Encoder Representations from Transformers):

config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./bert_model",
    architecture="bert",
)

Key characteristics:

Static masking
Includes NSP objective capability
Well-suited for tasks requiring sentence-pair understanding

Configuration Options

TrainingConfig Parameters

Parameter	Type	Default	Description
`input_file`	str	Required	Path to training corpus (one sentence per line)
`output_dir`	str	Required	Directory to save model and artifacts
`vocab_size`	int	30,000	Vocabulary size for BPE tokenizer
`min_frequency`	int	2	Minimum frequency for token inclusion
`epochs`	int	5	Number of training epochs
`batch_size`	int	16	Batch size per device
`learning_rate`	float	5e-5	Peak learning rate
`hidden_size`	int	256	Size of hidden layers
`num_layers`	int	4	Number of transformer layers
`num_heads`	int	4	Number of attention heads
`max_length`	int	128	Maximum sequence length
`architecture`	str	”roberta”	Model architecture (“roberta” or “bert”)
`resume_from_checkpoint`	str	None	Path to checkpoint directory to resume from
`warmup_ratio`	float	0.1	Ratio of steps for learning rate warmup
`weight_decay`	float	0.01	Weight decay for optimizer
`save_metrics`	bool	True	Save training metrics to JSON file
`keep_checkpoints`	bool	False	Keep intermediate checkpoints
`streaming`	bool	False	Use streaming dataset for large corpora
`checkpoint_dir`	str	None	Persistent checkpoint directory (e.g., `/opt/ml/checkpoints` on SageMaker)
`max_steps`	int	None	Cap total training steps (overrides epoch-based)
`word_boundary_aware`	bool	False	Use word-boundary-aware masking
`whole_word_masking`	bool	False	Mask entire words instead of subwords
`pos_file`	str	None	POS tag file for POS-aware masking
`denoising_ratio`	float	0.0	Ratio of denoising corruption (0 = disabled)
`fp16`	bool	False	Use mixed-precision (FP16) training
`gradient_accumulation_steps`	int	1	Steps to accumulate before optimizer step
`lr_scheduler_type`	str	”linear”	Learning rate scheduler type
`corruption_ratio`	float	0.0	Ratio of input corruption for denoising
`confusable_masking`	bool	False	Use confusable-aware masking (requires `whole_word_masking=True`)
`confusable_mask_ratio`	float	0.3	Ratio of masks replaced with confusable words
`confusable_words_file`	str	None	Path to confusable words list
`embedding_surgery`	bool	False	Enable embedding surgery for domain adaptation
`embedding_warmup_steps`	int	25,000	Warmup steps for embedding surgery
`embedding_lr`	float	1e-3	Learning rate for embedding layers during surgery

Architecture Constraints

The hidden_size must be divisible by num_heads. Valid combinations include:

hidden_size=256, num_heads=4 (64 per head)
hidden_size=256, num_heads=8 (32 per head)
hidden_size=512, num_heads=8 (64 per head)

Learning Rate Scheduling

The training pipeline uses linear learning rate scheduling with warmup:

config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    learning_rate=5e-5,     # Peak learning rate
    warmup_ratio=0.1,       # 10% of steps for warmup
    weight_decay=0.01,      # AdamW weight decay
)

The learning rate:

Starts at 0
Linearly increases to learning_rate over warmup_ratio * total_steps
Linearly decreases to 0 over remaining steps

Resume Training from Checkpoint

Training can be resumed from a checkpoint if interrupted:

# Initial training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    keep_checkpoints=True,  # Keep checkpoints for resume
)
pipeline = TrainingPipeline()
pipeline.run(config)  # Interrupted at epoch 5

# Resume training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    resume_from_checkpoint="./model/checkpoints/checkpoint-500",
)
pipeline.run(config)  # Continues from checkpoint

Checkpoints are saved every 500 steps by default.

Training Metrics

When save_metrics=True (default), training metrics are saved to training_metrics.json:

[
  {"step": 50, "epoch": 0.5, "loss": 8.234, "learning_rate": 2.5e-5},
  {"step": 100, "epoch": 1.0, "loss": 6.891, "learning_rate": 5e-5},
  ...
]

Metrics include:

step: Global training step
epoch: Current epoch (fractional)
loss: Training loss
learning_rate: Current learning rate

Low-Level API

For more control, use ModelTrainer directly:

from myspellchecker.training import ModelTrainer, ModelArchitecture

trainer = ModelTrainer()

# Step 1: Train tokenizer
tokenizer_path = trainer.train_tokenizer(
    corpus_path="corpus.txt",
    output_dir="./tokenizer",
    vocab_size=30000,
)

# Step 2: Train model
model_path = trainer.train_model(
    corpus_path="corpus.txt",
    tokenizer_path=tokenizer_path,
    output_dir="./model",
    architecture=ModelArchitecture.ROBERTA,
    epochs=5,
    warmup_ratio=0.1,
    save_metrics=True,
)

ONNX Export

Models are automatically exported to ONNX format with INT8 quantization:

from myspellchecker.training import ONNXExporter

exporter = ONNXExporter()
exporter.export(
    model_dir="./pytorch_model",
    output_dir="./onnx_model",
    quantize=True,  # INT8 quantization
)

The exported ONNX model can be used with SemanticChecker for context-aware validation. Output Files:

model.onnx - Quantized model (default)
model.base.onnx - Original FP32 model
tokenizer.json - Copied for convenience

Using Trained Models

With SemanticChecker

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./models/model.onnx",
        tokenizer_path="./models/tokenizer.json",
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Standalone Inference

import onnxruntime as ort
from transformers import PreTrainedTokenizerFast

# Load model and tokenizer
session = ort.InferenceSession("./models/model.onnx")
tokenizer = PreTrainedTokenizerFast(tokenizer_file="./models/tokenizer.json")

# Prepare input
text = "မြန်မာ<mask>သည်"
inputs = tokenizer(text, return_tensors="np")

# Run inference
outputs = session.run(
    ["logits"],
    {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
    }
)

CLI Usage

Train a model via CLI:

# Basic training
myspellchecker train-model -i corpus.txt -o ./models/

# With custom parameters
myspellchecker train-model -i corpus.txt -o ./models/ \
  --architecture roberta \
  --epochs 10 \
  --hidden-size 512 \
  --layers 6 \
  --heads 8 \
  --learning-rate 3e-5

# Resume from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
  --resume ./models/checkpoints/checkpoint-500

Corpus Format

The training corpus should be a text file with one sentence per line:

ကျွန်တော် မြန်မာ စာ လေ့လာ နေ ပါ တယ်
သူမ က စာအုပ် ဖတ် နေ တယ်
ဒီ နေ့ ရာသီ ဥတု ကောင်း တယ်

Requirements:

UTF-8 encoding
One sentence per line
Minimum 100 lines (recommended: 10,000+ lines)
Segmented text (spaces between words) works best

GPU Support

Training automatically uses GPU if available:

import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Batch Size by GPU VRAM

GPU VRAM	Recommended `batch_size`
4GB	8
8GB	16
16GB	32
24GB+	64

For CPU-only training:

# Training will automatically fall back to CPU if no GPU available
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=8,  # Reduce batch size for CPU
)

Model Size vs Quality

Configuration	Parameters	Quality	Speed
Small (default)	~5M	Good	Fast
Medium	~20M	Better	Medium
Large	~100M	Best	Slow

# Small (default)
config = TrainingConfig(hidden_size=256, num_layers=4, num_heads=4)

# Medium
config = TrainingConfig(hidden_size=512, num_layers=6, num_heads=8)

# Large
config = TrainingConfig(hidden_size=768, num_layers=12, num_heads=12)

Best Practices

Corpus Size: Use at least 10,000 sentences for meaningful results
Batch Size: Larger batches (16-32) generally train faster on GPU
Hidden Size: Start with 256 for small models, 512 for larger ones
Epochs: 5-10 epochs is usually sufficient; monitor loss for overfitting
Warmup: 10% warmup (0.1) helps training stability
Checkpoints: Enable keep_checkpoints=True for long training runs
Metrics: Always save metrics to monitor training progress

Troubleshooting

Memory Issues

# Reduce batch size and max_length
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=4,
    max_length=64,
)

Slow Training

# Check GPU availability
import torch
print(torch.cuda.is_available())

# Reduce model complexity
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    hidden_size=128,
    num_layers=2,
)

Invalid hidden_size/num_heads

# hidden_size must be divisible by num_heads
# This will raise ValueError:
config = TrainingConfig(
    hidden_size=256,
    num_heads=3,  # Error: 256 not divisible by 3
)

# Valid configuration:
config = TrainingConfig(
    hidden_size=256,
    num_heads=4,  # OK: 256 / 4 = 64
)

Neural Reranker Training

The neural reranker is a small MLP (Linear(19→64)→ReLU→Dropout→Linear(64→1), ~5K parameters) that learns to re-rank spell checker suggestions using 19 extracted features. It runs as the final step in the suggestion pipeline after N-gram and semantic reranking. See Neural Reranker for the full feature vector layout.

Prerequisites

Requires:

A production SQLite database (built by the data pipeline)
A segmented Arrow IPC corpus (produced during pipeline ingestion)
PyTorch: pip install myspellchecker[train]

Step 1: Generate Training Data

The RerankerDataGenerator creates labeled training data by corrupting clean sentences and collecting spell checker candidates:

from myspellchecker.training.reranker_data import RerankerDataGenerator

generator = RerankerDataGenerator(
    db_path="data/mySpellChecker_production.db",
    arrow_corpus_path="data/segmented_corpus.arrow",
)

# Generate training data (single-threaded)
generator.generate(
    num_examples=100_000,
    output_path="data/reranker_training.jsonl",
)

For large-scale generation, use the threaded entry point:

from myspellchecker.training.reranker_data import generate_threaded

stats = generate_threaded(
    db_path="data/mySpellChecker_production.db",
    arrow_corpus_path="data/segmented_corpus.arrow",
    output_path="data/reranker_training_100k.jsonl",
    num_examples=100_000,
)

Each JSONL line contains 19 features per candidate (edit distance, frequency, phonetic similarity, N-gram context, confusable status, source indicators, etc.) plus the gold correction index. See Neural Reranker for the full feature layout.

Step 2: Train the MLP

from myspellchecker.training.reranker_trainer import RerankerTrainer

trainer = RerankerTrainer("data/reranker_training.jsonl")
metrics = trainer.train(epochs=20)

# Export to ONNX
trainer.export_onnx("models/reranker-v1/reranker.onnx")
# Outputs: reranker.onnx + reranker.onnx.stats.json

Training parameters:

Parameter	Default	Description
`epochs`	`20`	Maximum training epochs
`lr`	`1e-3`	Learning rate
`batch_size`	`64`	Batch size
`patience`	`5`	Early stopping patience (on validation Top-1 accuracy)
`val_ratio`	`0.2`	Validation split ratio
`hidden_dim`	`64`	MLP hidden layer dimension
`dropout`	`0.1`	Dropout rate
`max_candidates`	`20`	Maximum candidates per example

CLI alternative:

python -m myspellchecker.training.reranker_trainer \
    --train data/reranker_training_100k.jsonl \
    --output models/reranker-v1/ \
    --epochs 20 --lr 1e-3 --batch-size 64

Step 3: Use the Trained Model

from myspellchecker.core.config import SpellCheckerConfig, NeuralRerankerConfig

config = SpellCheckerConfig(
    neural_reranker=NeuralRerankerConfig(
        enabled=True,
        model_path="models/reranker-v1/reranker.onnx",
        stats_path="models/reranker-v1/reranker.onnx.stats.json",
    ),
)

See Neural Reranker for inference details and the feature vector specification.

​Overview

​1. Semantic Model (MLM) Training

​2. Neural Reranker Training

​Prerequisites

​Quick Start

​Model Architectures

​RoBERTa (Default)

​BERT

​Configuration Options

​TrainingConfig Parameters

​Architecture Constraints

​Learning Rate Scheduling

​Resume Training from Checkpoint

​Training Metrics

​Low-Level API

​ONNX Export

​Using Trained Models

​With SemanticChecker

​Standalone Inference

​CLI Usage

​Corpus Format

​GPU Support

​Batch Size by GPU VRAM

​Model Size vs Quality

​Best Practices

​Troubleshooting

​Memory Issues

​Slow Training

​Invalid hidden_size/num_heads

​Neural Reranker Training

​Prerequisites

​Step 1: Generate Training Data

​Step 2: Train the MLP

​Step 3: Use the Trained Model

​See Also

Overview

1. Semantic Model (MLM) Training

2. Neural Reranker Training

Prerequisites

Quick Start

Model Architectures

RoBERTa (Default)

BERT

Configuration Options

TrainingConfig Parameters

Architecture Constraints

Learning Rate Scheduling

Resume Training from Checkpoint

Training Metrics

Low-Level API

ONNX Export

Using Trained Models

With SemanticChecker

Standalone Inference

CLI Usage

Corpus Format

GPU Support

Batch Size by GPU VRAM

Model Size vs Quality

Best Practices

Troubleshooting

Memory Issues

Slow Training

Invalid hidden_size/num_heads

Neural Reranker Training

Prerequisites

Step 1: Generate Training Data

Step 2: Train the MLP

Step 3: Use the Trained Model

See Also