Semantic Training Pipeline

Overview
Usage
CLI Usage
Python API Usage
Pipeline Stages
1. Tokenizer Training
2. Language Model Training
3. ONNX Export & Quantization
Hardware Requirements
Output Artifacts

The Training Pipeline allows you to create custom deep learning models for the SemanticChecker. While the standard N-gram checker works well for local context, these models (based on Transformers) capture long-range dependencies and semantic meaning.

Overview

The pipeline automates the entire process:

Tokenizer Training: creating a vocabulary from your specific corpus.
Model Training: Pre-training a transformer model (Masked Language Modeling).
Export: Converting the model to ONNX format for fast, dependency-light inference.

Usage

CLI Usage

# Train a model from a text corpus
myspellchecker train-model \
    --input corpus.txt \
    --output models/my_model \
    --epochs 5 \
    --vocab-size 15000

Python API Usage

from myspellchecker.training import TrainingPipeline, TrainingConfig

# Configure training parameters
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="models/v1",
    vocab_size=15000,
    epochs=5,
    batch_size=16,  # Default batch size
    hidden_size=256,  # Smaller size for speed/efficiency
    num_layers=4,
    num_heads=4
)

# Run the pipeline
pipeline = TrainingPipeline()
model_path = pipeline.run(config)

Pipeline Stages

1. Tokenizer Training

Goal: Create a subword tokenizer optimized for Myanmar text.
Algorithm: Byte-Level BPE (Byte-Pair Encoding).
Output: tokenizer.json

2. Language Model Training

Goal: Learn the probability distribution of words in context.
Architecture: RoBERTa or BERT (Encoder-only transformer, selected via ModelArchitecture enum).
Task: Masked Language Modeling (MLM). Random words are masked, and the model attempts to predict them.
Hyperparameters:
- hidden_size: Dimension of the embeddings (default: 256).
- num_layers: Number of transformer blocks (default: 4).
- num_heads: Attention heads (default: 4).

3. ONNX Export & Quantization

Goal: Optimize the model for production use.
Process:
- Converts the PyTorch dynamic graph to a static ONNX graph.
- Quantization: Converts 32-bit floating point weights to 8-bit unsigned integers (QUInt8). This reduces model size by 4x and speeds up CPU inference significantly with minimal accuracy loss.
Output: model.onnx

Hardware Requirements

Training: A GPU (NVIDIA CUDA or Mac MPS) is highly recommended but not strictly required. The pipeline automatically detects available accelerators.
Inference: The resulting ONNX models are designed to run efficiently on standard CPUs.

Output Artifacts

After a successful run, the output directory will contain:

/models/my_model/
├── model.onnx          # The quantized model for inference
├── tokenizer.json      # The tokenizer vocabulary
├── config.json         # Model configuration
└── pytorch_source/     # (Optional) Original PyTorch checkpoints

Word Validation Algorithm Factory

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Semantic Training Pipeline

Overview

Usage

CLI Usage

Python API Usage

Pipeline Stages

1. Tokenizer Training

2. Language Model Training

3. ONNX Export & Quantization

Hardware Requirements

Output Artifacts

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​Usage

​CLI Usage

​Python API Usage

​Pipeline Stages

​1. Tokenizer Training

​2. Language Model Training

​3. ONNX Export & Quantization

​Hardware Requirements

​Output Artifacts

Overview

Usage

CLI Usage

Python API Usage

Pipeline Stages

1. Tokenizer Training

2. Language Model Training

3. ONNX Export & Quantization

Hardware Requirements

Output Artifacts