Performance Optimization
DuckDB Acceleration
When DuckDB is installed, the pipeline automatically uses it for ultra-fast frequency counting:| Corpus Size | Python Mode | DuckDB Mode | Speedup |
|---|---|---|---|
| 100MB | 10s | 3s | 3x |
| 500MB | 45s | 8s | 5x |
| 1GB | 120s | 12s | 10x |
| 5GB | 600s+ | 40s | 15x |
| 10GB+ | Hours | ~90s | 50x+ |
- Arrow file is memory-mapped with PyArrow (efficient streaming)
- Arrow table is registered with DuckDB (zero-copy when possible)
- Single-pass SQL queries replace Python loops for aggregation
- Disk-based temp storage handles datasets larger than RAM
- Uses all available CPU threads
- Memory limit: 6GB (configurable via DuckDB settings)
- Temp storage: Uses work directory (not /tmp)
Parallel Processing
Enable parallel processing for faster builds:Optimal Worker Count
| CPU Cores | Recommended Workers |
|---|---|
| 2 | 2 |
| 4 | 4 |
| 8 | 6-8 |
| 16+ | 12-16 |
Batch Size Tuning
Larger batches improve throughput but use more memory:Memory Optimization
Sharding for Large Files
The pipeline automatically shards input files for memory-efficient processing:Intermediate Files
Use disk for intermediate Arrow files:I/O Optimization
SSD Storage
Use SSD for both input and output:Pre-sorted Input
Sorted input improves compression:Sharding Large Corpora
Split large files for parallel ingestion:Quality Optimization
Frequency Thresholds
Balance coverage vs. noise:Database Size
| min_frequency | DB Size (10M word corpus) |
|---|---|
| 1 | ~200MB |
| 50 | ~100MB |
| 100 | ~50MB |
Database Optimization
Index Strategy
Indexes are created automatically for fast lookups. The database includes:idx_syllables_text- Syllable text lookupsidx_words_text- Word text lookupsidx_bigrams_w1_w2- Bigram lookupsidx_trigrams_w1_w2_w3- Trigram lookups
Vacuum
Database is automatically compacted after building:Segmentation Optimization
Segmenter Selection
Choose segmenter based on needs:Cython Acceleration
Ensure Cython extensions are compiled:Benchmarking
Measure Build Time
Profile Memory
Troubleshooting
Out of Memory
Slow Build
Large Output Database
Recommended Configurations
Small Corpus (<100MB)
Large Corpus (1-10GB)
Very Large Corpus (>10GB)
pip install duckdb>=1.0.0) for optimal performance.
The FrequencyBuilder automatically uses DuckDB when installed, providing 10-50x faster processing for large files.
See Also
- Building Stage - Build process details
- Performance Tuning - Runtime optimization
- Pipeline Index - Pipeline overview