Skip to main content
The Training Pipeline allows you to create custom deep learning models for the SemanticChecker. While the standard N-gram checker works well for local context, these models (based on Transformers) capture long-range dependencies and semantic meaning.

Overview

The pipeline automates the entire process:
  1. Tokenizer Training: creating a vocabulary from your specific corpus.
  2. Model Training: Pre-training a transformer model (Masked Language Modeling).
  3. Export: Converting the model to ONNX format for fast, dependency-light inference.

Usage

CLI Usage

# Train a model from a text corpus
myspellchecker train-model \
    --input corpus.txt \
    --output models/my_model \
    --epochs 5 \
    --vocab-size 15000

Python API Usage

from myspellchecker.training import TrainingPipeline, TrainingConfig

# Configure training parameters
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="models/v1",
    vocab_size=15000,
    epochs=5,
    batch_size=16,  # Default batch size
    hidden_size=256,  # Smaller size for speed/efficiency
    num_layers=4,
    num_heads=4
)

# Run the pipeline
pipeline = TrainingPipeline()
model_path = pipeline.run(config)

Pipeline Stages

1. Tokenizer Training

  • Goal: Create a subword tokenizer optimized for Myanmar text.
  • Algorithm: Byte-Level BPE (Byte-Pair Encoding).
  • Output: tokenizer.json

2. Language Model Training

  • Goal: Learn the probability distribution of words in context.
  • Architecture: RoBERTa or BERT (Encoder-only transformer, selected via ModelArchitecture enum).
  • Task: Masked Language Modeling (MLM). Random words are masked, and the model attempts to predict them.
  • Hyperparameters:
    • hidden_size: Dimension of the embeddings (default: 256).
    • num_layers: Number of transformer blocks (default: 4).
    • num_heads: Attention heads (default: 4).

3. ONNX Export & Quantization

  • Goal: Optimize the model for production use.
  • Process:
    • Converts the PyTorch dynamic graph to a static ONNX graph.
    • Quantization: Converts 32-bit floating point weights to 8-bit unsigned integers (QUInt8). This reduces model size by 4x and speeds up CPU inference significantly with minimal accuracy loss.
  • Output: model.onnx

Hardware Requirements

  • Training: A GPU (NVIDIA CUDA or Mac MPS) is highly recommended but not strictly required. The pipeline automatically detects available accelerators.
  • Inference: The resulting ONNX models are designed to run efficiently on standard CPUs.

Output Artifacts

After a successful run, the output directory will contain:
/models/my_model/
├── model.onnx          # The quantized model for inference
├── tokenizer.json      # The tokenizer vocabulary
├── config.json         # Model configuration
└── pytorch_source/     # (Optional) Original PyTorch checkpoints