Skip to main content
The Training Pipeline allows you to create custom deep learning models for the SemanticChecker. While the standard N-gram checker works well for local context, these models (based on Transformers) capture long-range dependencies and semantic meaning.

Overview

The pipeline automates the entire process:
  1. Tokenizer Training: creating a vocabulary from your specific corpus.
  2. Model Training: Pre-training a transformer model (Masked Language Modeling).
  3. Export: Converting the model to ONNX format for fast, dependency-light inference.

Usage

CLI Usage

# Train a model from a text corpus
myspellchecker train-model \
    --input corpus.txt \
    --output models/my_model \
    --epochs 5 \
    --vocab-size 30000

Python API Usage

from myspellchecker.training import TrainingPipeline, TrainingConfig

# Configure training parameters
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="models/v1",
    vocab_size=30000,
    epochs=5,
    batch_size=16,
    learning_rate=5e-5,
    hidden_size=256,
    num_layers=4,
    num_heads=4,
    max_length=128,
    architecture="roberta",  # or "bert"
    fp16=True,  # Mixed precision for faster training
)

# Run the pipeline
pipeline = TrainingPipeline()
model_path = pipeline.run(config)

TrainingConfig Reference

ParameterTypeDefaultDescription
input_filestrrequiredPath to training corpus (one sentence per line).
output_dirstrrequiredDirectory to save trained model and artifacts.
vocab_sizeint30,000Vocabulary size for BPE tokenizer.
min_frequencyint2Minimum frequency for token inclusion in vocabulary.
epochsint5Number of training epochs.
batch_sizeint16Batch size per device.
learning_ratefloat5e-5Peak learning rate.
hidden_sizeint256Size of hidden layers (must be divisible by num_heads).
num_layersint4Number of transformer layers.
num_headsint4Number of attention heads.
max_lengthint128Maximum sequence length.
architecturestr"roberta"Model architecture ("roberta" or "bert"). See ModelArchitecture enum.
warmup_ratiofloat0.1Ratio of total steps used for learning rate warmup.
weight_decayfloat0.01Weight decay for the optimizer.
fp16boolFalseEnable mixed-precision (FP16) training for faster GPU training.
streamingboolFalseUse streaming mode for large corpora (constant memory).
resume_from_checkpointstr | NoneNonePath to a checkpoint directory to resume training from.
checkpoint_dirstr | NoneNonePersistent directory for checkpoints (e.g., /opt/ml/checkpoints). Survives job restarts; completed steps are auto-skipped on resume.
keep_checkpointsboolFalseKeep intermediate PyTorch checkpoints after ONNX export.
save_metricsboolTrueSave training metrics to a JSON file.

ModelArchitecture Enum

The architecture field accepts values from the ModelArchitecture enum:
ValueDescription
"roberta"RoBERTa architecture (default). Dynamic masking, no NSP task.
"bert"BERT architecture. Static masking with standard BERT pre-training.
from myspellchecker.training import ModelArchitecture

arch = ModelArchitecture.from_string("roberta")  # ModelArchitecture.ROBERTA

Additional Training Exports

The myspellchecker.training module also exports these utilities:
ClassModulePurpose
CorpusPreprocessortraining.corpus_preprocessorClean and prepare raw text corpora before training. No optional dependencies required.
SyntheticErrorGeneratortraining.generatorGenerate synthetic spelling errors for data augmentation and denoising training.
RerankerTrainertraining.reranker_trainerTrain a RerankerMLP model on JSONL data with early stopping. Import directly: from myspellchecker.training.reranker_trainer import RerankerTrainer.
from myspellchecker.training import CorpusPreprocessor, SyntheticErrorGenerator

Pipeline Stages

1. Tokenizer Training

  • Goal: Create a subword tokenizer optimized for Myanmar text.
  • Algorithm: Byte-Level BPE (Byte-Pair Encoding).
  • Output: tokenizer.json

2. Language Model Training

  • Goal: Learn the probability distribution of words in context.
  • Architecture: RoBERTa or BERT (Encoder-only transformer, selected via ModelArchitecture enum).
  • Task: Masked Language Modeling (MLM). Random words are masked, and the model attempts to predict them.
  • Hyperparameters:
    • hidden_size: Dimension of the embeddings (default: 256).
    • num_layers: Number of transformer blocks (default: 4).
    • num_heads: Attention heads (default: 4).

3. ONNX Export & Quantization

  • Goal: Optimize the model for production use.
  • Process:
    • Converts the PyTorch dynamic graph to a static ONNX graph.
    • Quantization: Converts 32-bit floating point weights to 8-bit unsigned integers (QUInt8). This reduces model size by 4x and speeds up CPU inference significantly with minimal accuracy loss.
  • Output: model.onnx

Hardware Requirements

  • Training: A GPU (NVIDIA CUDA or Mac MPS) is highly recommended but not strictly required. The pipeline automatically detects available accelerators.
  • Inference: The resulting ONNX models are designed to run efficiently on standard CPUs.

Output Artifacts

After a successful run, the output directory will contain:
models/my_model
model.onnx
tokenizer.json
config.json
pytorch_source