Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
For large corpora (100MB+), the default Python pipeline can be slow. Install DuckDB for automatic 3-15x faster frequency counting, and ensure Cython extensions are compiled for parallel segmentation.
DuckDB Acceleration
When DuckDB is installed, the pipeline automatically uses it for faster frequency counting:
from myspellchecker.data_pipeline import FrequencyBuilder
# DuckDB is auto-enabled when the duckdb package is installed
builder = FrequencyBuilder(input_dir="/path/to/input", output_dir="/path/to/output")
builder.load_data() # Auto-detects and uses DuckDB if available
Performance Comparison:
| Corpus Size | Python Mode | DuckDB Mode | Speedup |
|---|
| 100MB | 10s | 3s | 3x |
| 500MB | 45s | 8s | 5x |
| 1GB | 120s | 12s | 10x |
| 5GB | 600s+ | 40s | 15x |
How It Works:
- Arrow file is memory-mapped with PyArrow (efficient streaming)
- Arrow table is registered with DuckDB (zero-copy when possible)
- Single-pass SQL queries replace Python loops for aggregation
- Disk-based temp storage handles datasets larger than RAM
Requirements:
# DuckDB is included in the build extra
pip install myspellchecker[build]
# Or install directly
pip install duckdb>=1.0.0
Resource Configuration:
DuckDB automatically configures itself for optimal performance:
- Uses all available CPU threads
- Memory limit: adaptive, calculated as
min(max(total_ram * 0.25, 2GB), 8GB), which scales with system RAM, clamped between 2GB and 8GB
- Temp storage: Uses work directory (not /tmp)
Parallel Processing
Enable parallel processing for faster builds:
from myspellchecker.data_pipeline import PipelineConfig
config = PipelineConfig(
num_workers=8, # Use 8 CPU cores
batch_size=50000, # Records per batch
)
Optimal Worker Count
| CPU Cores | Recommended Workers |
|---|
| 2 | 2 |
| 4 | 4 |
| 8 | 6-8 |
| 16+ | 12-16 |
Batch Size Tuning
Larger batches improve throughput but use more memory:
# Memory-constrained (4GB RAM)
config = PipelineConfig(batch_size=5000)
# Balanced (8-16GB RAM)
config = PipelineConfig(batch_size=20000)
# High-memory (32GB+ RAM)
config = PipelineConfig(batch_size=100000)
Memory Optimization
Sharding for Large Files
The pipeline automatically shards input files for memory-efficient processing:
config = PipelineConfig(
num_shards=50, # More shards for larger files
batch_size=10000, # Smaller batches for less memory
)
Use disk for intermediate Arrow files:
config = PipelineConfig(
work_dir="/tmp/pipeline",
keep_intermediate=False, # Clean up after
)
I/O Optimization
SSD Storage
Use SSD for both input and output:
# Place corpus and output on SSD
myspellchecker build --input /ssd/corpus.txt --output /ssd/dict.db
Sorted input improves compression:
# Sort corpus alphabetically
sort corpus.txt > sorted_corpus.txt
myspellchecker build --input sorted_corpus.txt --output dict.db
Sharding Large Corpora
Split large files for parallel ingestion:
# Split into 100MB chunks
split -b 100m corpus.txt shard_
# Process with glob pattern
myspellchecker build --input "shard_*" --output dict.db
Quality Optimization
Frequency Thresholds
Balance coverage vs. noise:
# Include rare words (noisy)
config = PipelineConfig(min_frequency=1)
# Standard (balanced, default)
config = PipelineConfig(min_frequency=50)
# Only common words (clean)
config = PipelineConfig(min_frequency=100)
Database Size
| min_frequency | DB Size (10M word corpus) |
|---|
| 1 | ~200MB |
| 50 | ~100MB |
| 100 | ~50MB |
Database Optimization
Index Strategy
Indexes are created automatically for fast lookups. The database includes:
idx_syllables_text - Syllable text lookups
idx_words_text - Word text lookups
idx_bigrams_w1_w2 - Bigram lookups
idx_trigrams_w1_w2_w3 - Trigram lookups
Vacuum
Database is automatically compacted after building:
from myspellchecker.data_pipeline import DatabasePackager
# DatabasePackager takes input_dir and database_path directly
packager = DatabasePackager(input_dir="/path/to/input", database_path="output.db")
# Vacuum is applied automatically during packaging
Segmentation Optimization
Segmenter Selection
Choose segmenter based on needs:
# Fastest (for quick builds)
config = PipelineConfig(word_engine="crf")
# Best accuracy (for production)
config = PipelineConfig(word_engine="myword")
Cython Acceleration
Ensure Cython extensions are compiled:
python setup.py build_ext --inplace
Benchmarking
Measure Build Time
import time
start = time.time()
pipeline.build_database(input_files, database_path)
elapsed = time.time() - start
print(f"Build completed in {elapsed:.1f}s")
Profile Memory
import tracemalloc
tracemalloc.start()
pipeline.build_database(input_files, database_path)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Peak memory: {peak / 1024 / 1024:.1f} MB")
Troubleshooting
Out of Memory
# Reduce batch size
config = PipelineConfig(batch_size=5000)
# Increase shards for better memory distribution
config = PipelineConfig(num_shards=50)
# Limit workers
config = PipelineConfig(num_workers=2)
Slow Build
# Increase workers
config = PipelineConfig(num_workers=8)
# Increase batch size
config = PipelineConfig(batch_size=50000)
# Use faster segmenter
config = PipelineConfig(word_engine="crf")
Install DuckDB for significant speedup on all corpus sizes:
pip install duckdb>=1.0.0
DuckDB is used by default when installed, providing 3-15x faster frequency counting via SQL-based aggregation.
Large Output Database
# Increase min_frequency
config = PipelineConfig(min_frequency=100)
Recommended Configurations
Small Corpus (<100MB)
config = PipelineConfig(
num_workers=4,
batch_size=10000,
min_frequency=10,
)
Large Corpus (1-10GB)
config = PipelineConfig(
num_shards=50,
num_workers=8,
batch_size=50000,
min_frequency=50,
)
Very Large Corpus (>10GB)
config = PipelineConfig(
work_dir="/ssd/tmp",
num_shards=100,
num_workers=12,
batch_size=100000,
min_frequency=100,
)
Recommended: Install DuckDB (pip install duckdb>=1.0.0) for optimal performance.
The FrequencyBuilder automatically uses DuckDB when installed, providing 3-15x faster processing for large files.
See Also