SemanticChecker. While the standard N-gram checker works well for local context, these models (based on Transformers) capture long-range dependencies and semantic meaning.
Overview
The pipeline automates the entire process:- Tokenizer Training: creating a vocabulary from your specific corpus.
- Model Training: Pre-training a transformer model (Masked Language Modeling).
- Export: Converting the model to ONNX format for fast, dependency-light inference.
Usage
CLI Usage
Python API Usage
TrainingConfig Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
input_file | str | required | Path to training corpus (one sentence per line). |
output_dir | str | required | Directory to save trained model and artifacts. |
vocab_size | int | 30,000 | Vocabulary size for BPE tokenizer. |
min_frequency | int | 2 | Minimum frequency for token inclusion in vocabulary. |
epochs | int | 5 | Number of training epochs. |
batch_size | int | 16 | Batch size per device. |
learning_rate | float | 5e-5 | Peak learning rate. |
hidden_size | int | 256 | Size of hidden layers (must be divisible by num_heads). |
num_layers | int | 4 | Number of transformer layers. |
num_heads | int | 4 | Number of attention heads. |
max_length | int | 128 | Maximum sequence length. |
architecture | str | "roberta" | Model architecture ("roberta" or "bert"). See ModelArchitecture enum. |
warmup_ratio | float | 0.1 | Ratio of total steps used for learning rate warmup. |
weight_decay | float | 0.01 | Weight decay for the optimizer. |
fp16 | bool | False | Enable mixed-precision (FP16) training for faster GPU training. |
streaming | bool | False | Use streaming mode for large corpora (constant memory). |
resume_from_checkpoint | str | None | None | Path to a checkpoint directory to resume training from. |
checkpoint_dir | str | None | None | Persistent directory for checkpoints (e.g., /opt/ml/checkpoints). Survives job restarts; completed steps are auto-skipped on resume. |
keep_checkpoints | bool | False | Keep intermediate PyTorch checkpoints after ONNX export. |
save_metrics | bool | True | Save training metrics to a JSON file. |
ModelArchitecture Enum
Thearchitecture field accepts values from the ModelArchitecture enum:
| Value | Description |
|---|---|
"roberta" | RoBERTa architecture (default). Dynamic masking, no NSP task. |
"bert" | BERT architecture. Static masking with standard BERT pre-training. |
Additional Training Exports
Themyspellchecker.training module also exports these utilities:
| Class | Module | Purpose |
|---|---|---|
CorpusPreprocessor | training.corpus_preprocessor | Clean and prepare raw text corpora before training. No optional dependencies required. |
SyntheticErrorGenerator | training.generator | Generate synthetic spelling errors for data augmentation and denoising training. |
RerankerTrainer | training.reranker_trainer | Train a RerankerMLP model on JSONL data with early stopping. Import directly: from myspellchecker.training.reranker_trainer import RerankerTrainer. |
Pipeline Stages
1. Tokenizer Training
- Goal: Create a subword tokenizer optimized for Myanmar text.
- Algorithm: Byte-Level BPE (Byte-Pair Encoding).
- Output:
tokenizer.json
2. Language Model Training
- Goal: Learn the probability distribution of words in context.
- Architecture: RoBERTa or BERT (Encoder-only transformer, selected via
ModelArchitectureenum). - Task: Masked Language Modeling (MLM). Random words are masked, and the model attempts to predict them.
- Hyperparameters:
hidden_size: Dimension of the embeddings (default: 256).num_layers: Number of transformer blocks (default: 4).num_heads: Attention heads (default: 4).
3. ONNX Export & Quantization
- Goal: Optimize the model for production use.
- Process:
- Converts the PyTorch dynamic graph to a static ONNX graph.
- Quantization: Converts 32-bit floating point weights to 8-bit unsigned integers (QUInt8). This reduces model size by 4x and speeds up CPU inference significantly with minimal accuracy loss.
- Output:
model.onnx
Hardware Requirements
- Training: A GPU (NVIDIA CUDA or Mac MPS) is highly recommended but not strictly required. The pipeline automatically detects available accelerators.
- Inference: The resulting ONNX models are designed to run efficiently on standard CPUs.
Output Artifacts
After a successful run, the output directory will contain:models/my_model
model.onnx
tokenizer.json
config.json
pytorch_source