Skip to main content
Pre-trained Myanmar language models are scarce and rarely cover domain-specific vocabulary. Instead of shipping generic models that underperform, mySpellChecker provides training pipelines so you can build models tuned to your exact corpus, handling tokenizer creation, model training, and ONNX export end-to-end.

Overview

mySpellChecker provides two training pipelines:

1. Semantic Model (MLM) Training

Trains a custom Masked Language Model for semantic validation:
Raw Text → Tokenizer Training → Model Training → ONNX Export
StageOutputPurpose
Tokenizertokenizer.jsonByte-Level BPE tokenizer for Myanmar
Model TrainingPyTorch checkpointMasked Language Model
ONNX Exportmodel.onnxOptimized inference model

2. Neural Reranker Training

Trains a small MLP to re-rank spell checker suggestions using learned feature weights:
Arrow Corpus → Synthetic Errors → Candidate Collection → Feature Extraction → Train MLPONNX Export
StageOutputPurpose
Data Generationreranker_training.jsonl19-feature vectors per candidate with gold labels
MLP TrainingPyTorch checkpointListwise cross-entropy (ListMLE) scorer
ONNX Exportreranker.onnx + stats.jsonQuantized model + feature normalization stats

Prerequisites

Install the training dependencies:
pip install myspellchecker[train]
This installs:
  • torch - PyTorch for model training
  • transformers - HuggingFace Transformers for model architectures
  • tokenizers - Fast tokenizer library
  • onnx - ONNX export support
  • onnxruntime - ONNX inference runtime

Quick Start

The simplest way to train a model:
from myspellchecker.training import TrainingPipeline, TrainingConfig

# Configure training
config = TrainingConfig(
    input_file="corpus.txt",  # One sentence per line
    output_dir="./my_model",
    epochs=5,
)

# Run training
pipeline = TrainingPipeline()
model_path = pipeline.run(config)
print(f"Model saved to: {model_path}")

Model Architectures

The training pipeline supports two transformer architectures:

RoBERTa (Default)

RoBERTa (Robustly Optimized BERT Pretraining Approach) is recommended for most use cases:
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./roberta_model",
    architecture="roberta",  # Default
)
Key characteristics:
  • Dynamic masking during training
  • No Next Sentence Prediction (NSP) objective
  • Larger batch sizes and more training data typically improve results

BERT

BERT (Bidirectional Encoder Representations from Transformers):
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./bert_model",
    architecture="bert",
)
Key characteristics:
  • Static masking
  • Includes NSP objective capability
  • Well-suited for tasks requiring sentence-pair understanding

Configuration Options

TrainingConfig Parameters

ParameterTypeDefaultDescription
input_filestrRequiredPath to training corpus (one sentence per line)
output_dirstrRequiredDirectory to save model and artifacts
vocab_sizeint30,000Vocabulary size for BPE tokenizer
min_frequencyint2Minimum frequency for token inclusion
epochsint5Number of training epochs
batch_sizeint16Batch size per device
learning_ratefloat5e-5Peak learning rate
hidden_sizeint256Size of hidden layers
num_layersint4Number of transformer layers
num_headsint4Number of attention heads
max_lengthint128Maximum sequence length
architecturestr”roberta”Model architecture (“roberta” or “bert”)
resume_from_checkpointstrNonePath to checkpoint directory to resume from
warmup_ratiofloat0.1Ratio of steps for learning rate warmup
weight_decayfloat0.01Weight decay for optimizer
save_metricsboolTrueSave training metrics to JSON file
keep_checkpointsboolFalseKeep intermediate checkpoints
streamingboolFalseUse streaming dataset for large corpora
checkpoint_dirstrNonePersistent checkpoint directory (e.g., /opt/ml/checkpoints on SageMaker)
max_stepsintNoneCap total training steps (overrides epoch-based)
word_boundary_awareboolFalseUse word-boundary-aware masking
whole_word_maskingboolFalseMask entire words instead of subwords
pos_filestrNonePOS tag file for POS-aware masking
denoising_ratiofloat0.0Ratio of denoising corruption (0 = disabled)
fp16boolFalseUse mixed-precision (FP16) training
gradient_accumulation_stepsint1Steps to accumulate before optimizer step
lr_scheduler_typestr”linear”Learning rate scheduler type
corruption_ratiofloat0.0Ratio of input corruption for denoising
confusable_maskingboolFalseUse confusable-aware masking (requires whole_word_masking=True)
confusable_mask_ratiofloat0.3Ratio of masks replaced with confusable words
confusable_words_filestrNonePath to confusable words list
embedding_surgeryboolFalseEnable embedding surgery for domain adaptation
embedding_warmup_stepsint25,000Warmup steps for embedding surgery
embedding_lrfloat1e-3Learning rate for embedding layers during surgery

Architecture Constraints

The hidden_size must be divisible by num_heads. Valid combinations include:
  • hidden_size=256, num_heads=4 (64 per head)
  • hidden_size=256, num_heads=8 (32 per head)
  • hidden_size=512, num_heads=8 (64 per head)

Learning Rate Scheduling

The training pipeline uses linear learning rate scheduling with warmup:
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    learning_rate=5e-5,     # Peak learning rate
    warmup_ratio=0.1,       # 10% of steps for warmup
    weight_decay=0.01,      # AdamW weight decay
)
The learning rate:
  1. Starts at 0
  2. Linearly increases to learning_rate over warmup_ratio * total_steps
  3. Linearly decreases to 0 over remaining steps

Resume Training from Checkpoint

Training can be resumed from a checkpoint if interrupted:
# Initial training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    keep_checkpoints=True,  # Keep checkpoints for resume
)
pipeline = TrainingPipeline()
pipeline.run(config)  # Interrupted at epoch 5

# Resume training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    resume_from_checkpoint="./model/checkpoints/checkpoint-500",
)
pipeline.run(config)  # Continues from checkpoint
Checkpoints are saved every 500 steps by default.

Training Metrics

When save_metrics=True (default), training metrics are saved to training_metrics.json:
[
  {"step": 50, "epoch": 0.5, "loss": 8.234, "learning_rate": 2.5e-5},
  {"step": 100, "epoch": 1.0, "loss": 6.891, "learning_rate": 5e-5},
  ...
]
Metrics include:
  • step: Global training step
  • epoch: Current epoch (fractional)
  • loss: Training loss
  • learning_rate: Current learning rate

Low-Level API

For more control, use ModelTrainer directly:
from myspellchecker.training import ModelTrainer, ModelArchitecture

trainer = ModelTrainer()

# Step 1: Train tokenizer
tokenizer_path = trainer.train_tokenizer(
    corpus_path="corpus.txt",
    output_dir="./tokenizer",
    vocab_size=30000,
)

# Step 2: Train model
model_path = trainer.train_model(
    corpus_path="corpus.txt",
    tokenizer_path=tokenizer_path,
    output_dir="./model",
    architecture=ModelArchitecture.ROBERTA,
    epochs=5,
    warmup_ratio=0.1,
    save_metrics=True,
)

ONNX Export

Models are automatically exported to ONNX format with INT8 quantization:
from myspellchecker.training import ONNXExporter

exporter = ONNXExporter()
exporter.export(
    model_dir="./pytorch_model",
    output_dir="./onnx_model",
    quantize=True,  # INT8 quantization
)
The exported ONNX model can be used with SemanticChecker for context-aware validation. Output Files:
  • model.onnx - Quantized model (default)
  • model.base.onnx - Original FP32 model
  • tokenizer.json - Copied for convenience

Using Trained Models

With SemanticChecker

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./models/model.onnx",
        tokenizer_path="./models/tokenizer.json",
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Standalone Inference

import onnxruntime as ort
from transformers import PreTrainedTokenizerFast

# Load model and tokenizer
session = ort.InferenceSession("./models/model.onnx")
tokenizer = PreTrainedTokenizerFast(tokenizer_file="./models/tokenizer.json")

# Prepare input
text = "မြန်မာ<mask>သည်"
inputs = tokenizer(text, return_tensors="np")

# Run inference
outputs = session.run(
    ["logits"],
    {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
    }
)

CLI Usage

Train a model via CLI:
# Basic training
myspellchecker train-model -i corpus.txt -o ./models/

# With custom parameters
myspellchecker train-model -i corpus.txt -o ./models/ \
  --architecture roberta \
  --epochs 10 \
  --hidden-size 512 \
  --layers 6 \
  --heads 8 \
  --learning-rate 3e-5

# Resume from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
  --resume ./models/checkpoints/checkpoint-500

Corpus Format

The training corpus should be a text file with one sentence per line:
ကျွန်တော် မြန်မာ စာ လေ့လာ နေ ပါ တယ်
သူမ က စာအုပ် ဖတ် နေ တယ်
ဒီ နေ့ ရာသီ ဥတု ကောင်း တယ်
Requirements:
  • UTF-8 encoding
  • One sentence per line
  • Minimum 100 lines (recommended: 10,000+ lines)
  • Segmented text (spaces between words) works best

GPU Support

Training automatically uses GPU if available:
import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Batch Size by GPU VRAM

GPU VRAMRecommended batch_size
4GB8
8GB16
16GB32
24GB+64
For CPU-only training:
# Training will automatically fall back to CPU if no GPU available
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=8,  # Reduce batch size for CPU
)

Model Size vs Quality

ConfigurationParametersQualitySpeed
Small (default)~5MGoodFast
Medium~20MBetterMedium
Large~100MBestSlow
# Small (default)
config = TrainingConfig(hidden_size=256, num_layers=4, num_heads=4)

# Medium
config = TrainingConfig(hidden_size=512, num_layers=6, num_heads=8)

# Large
config = TrainingConfig(hidden_size=768, num_layers=12, num_heads=12)

Best Practices

  1. Corpus Size: Use at least 10,000 sentences for meaningful results
  2. Batch Size: Larger batches (16-32) generally train faster on GPU
  3. Hidden Size: Start with 256 for small models, 512 for larger ones
  4. Epochs: 5-10 epochs is usually sufficient; monitor loss for overfitting
  5. Warmup: 10% warmup (0.1) helps training stability
  6. Checkpoints: Enable keep_checkpoints=True for long training runs
  7. Metrics: Always save metrics to monitor training progress

Troubleshooting

Memory Issues

# Reduce batch size and max_length
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=4,
    max_length=64,
)

Slow Training

# Check GPU availability
import torch
print(torch.cuda.is_available())

# Reduce model complexity
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    hidden_size=128,
    num_layers=2,
)

Invalid hidden_size/num_heads

# hidden_size must be divisible by num_heads
# This will raise ValueError:
config = TrainingConfig(
    hidden_size=256,
    num_heads=3,  # Error: 256 not divisible by 3
)

# Valid configuration:
config = TrainingConfig(
    hidden_size=256,
    num_heads=4,  # OK: 256 / 4 = 64
)

Neural Reranker Training

The neural reranker is a small MLP (Linear(19→64)→ReLU→Dropout→Linear(64→1), ~5K parameters) that learns to re-rank spell checker suggestions using 19 extracted features. It runs as the final step in the suggestion pipeline after N-gram and semantic reranking. See Neural Reranker for the full feature vector layout.

Prerequisites

Requires:
  • A production SQLite database (built by the data pipeline)
  • A segmented Arrow IPC corpus (produced during pipeline ingestion)
  • PyTorch: pip install myspellchecker[train]

Step 1: Generate Training Data

The RerankerDataGenerator creates labeled training data by corrupting clean sentences and collecting spell checker candidates:
from myspellchecker.training.reranker_data import RerankerDataGenerator

generator = RerankerDataGenerator(
    db_path="data/mySpellChecker_production.db",
    arrow_corpus_path="data/segmented_corpus.arrow",
)

# Generate training data (single-threaded)
generator.generate(
    num_examples=100_000,
    output_path="data/reranker_training.jsonl",
)
For large-scale generation, use the threaded entry point:
from myspellchecker.training.reranker_data import generate_threaded

stats = generate_threaded(
    db_path="data/mySpellChecker_production.db",
    arrow_corpus_path="data/segmented_corpus.arrow",
    output_path="data/reranker_training_100k.jsonl",
    num_examples=100_000,
)
Each JSONL line contains 19 features per candidate (edit distance, frequency, phonetic similarity, N-gram context, confusable status, source indicators, etc.) plus the gold correction index. See Neural Reranker for the full feature layout.

Step 2: Train the MLP

from myspellchecker.training.reranker_trainer import RerankerTrainer

trainer = RerankerTrainer("data/reranker_training.jsonl")
metrics = trainer.train(epochs=20)

# Export to ONNX
trainer.export_onnx("models/reranker-v1/reranker.onnx")
# Outputs: reranker.onnx + reranker.onnx.stats.json
Training parameters:
ParameterDefaultDescription
epochs20Maximum training epochs
lr1e-3Learning rate
batch_size64Batch size
patience5Early stopping patience (on validation Top-1 accuracy)
val_ratio0.2Validation split ratio
hidden_dim64MLP hidden layer dimension
dropout0.1Dropout rate
max_candidates20Maximum candidates per example
CLI alternative:
python -m myspellchecker.training.reranker_trainer \
    --train data/reranker_training_100k.jsonl \
    --output models/reranker-v1/ \
    --epochs 20 --lr 1e-3 --batch-size 64

Step 3: Use the Trained Model

from myspellchecker.core.config import SpellCheckerConfig, NeuralRerankerConfig

config = SpellCheckerConfig(
    neural_reranker=NeuralRerankerConfig(
        enabled=True,
        model_path="models/reranker-v1/reranker.onnx",
        stats_path="models/reranker-v1/reranker.onnx.stats.json",
    ),
)
See Neural Reranker for inference details and the feature vector specification.

See Also