Skip to main content
For large corpora (100MB+), the default Python pipeline can be slow. Install DuckDB for automatic 3-50x faster frequency counting, and ensure Cython extensions are compiled for parallel segmentation.

Performance Optimization

DuckDB Acceleration

When DuckDB is installed, the pipeline automatically uses it for ultra-fast frequency counting:
from myspellchecker.data_pipeline import FrequencyBuilder

# DuckDB is auto-enabled when the duckdb package is installed
builder = FrequencyBuilder(input_dir="/path/to/input", output_dir="/path/to/output")
builder.load_data()  # Auto-detects and uses DuckDB

# Explicitly disable DuckDB (use pure Python/Arrow)
builder.load_data(use_duckdb=False)
Performance Comparison:
Corpus SizePython ModeDuckDB ModeSpeedup
100MB10s3s3x
500MB45s8s5x
1GB120s12s10x
5GB600s+40s15x
10GB+Hours~90s50x+
How It Works:
  1. Arrow file is memory-mapped with PyArrow (efficient streaming)
  2. Arrow table is registered with DuckDB (zero-copy when possible)
  3. Single-pass SQL queries replace Python loops for aggregation
  4. Disk-based temp storage handles datasets larger than RAM
Requirements:
# DuckDB is an optional dependency
pip install myspellchecker[duckdb]

# Or install directly
pip install duckdb>=1.0.0
Resource Configuration: DuckDB automatically configures itself for optimal performance:
  • Uses all available CPU threads
  • Memory limit: 6GB (configurable via DuckDB settings)
  • Temp storage: Uses work directory (not /tmp)

Parallel Processing

Enable parallel processing for faster builds:
from myspellchecker.data_pipeline import PipelineConfig

config = PipelineConfig(
    num_workers=8,      # Use 8 CPU cores
    batch_size=50000,   # Records per batch
)

Optimal Worker Count

CPU CoresRecommended Workers
22
44
86-8
16+12-16

Batch Size Tuning

Larger batches improve throughput but use more memory:
# Memory-constrained (4GB RAM)
config = PipelineConfig(batch_size=5000)

# Balanced (8-16GB RAM)
config = PipelineConfig(batch_size=20000)

# High-memory (32GB+ RAM)
config = PipelineConfig(batch_size=100000)

Memory Optimization

Sharding for Large Files

The pipeline automatically shards input files for memory-efficient processing:
config = PipelineConfig(
    num_shards=50,           # More shards for larger files
    batch_size=10000,        # Smaller batches for less memory
)

Intermediate Files

Use disk for intermediate Arrow files:
config = PipelineConfig(
    work_dir="/tmp/pipeline",
    keep_intermediate=False,  # Clean up after
)

I/O Optimization

SSD Storage

Use SSD for both input and output:
# Place corpus and output on SSD
myspellchecker build --input /ssd/corpus.txt --output /ssd/dict.db

Pre-sorted Input

Sorted input improves compression:
# Sort corpus alphabetically
sort corpus.txt > sorted_corpus.txt
myspellchecker build --input sorted_corpus.txt --output dict.db

Sharding Large Corpora

Split large files for parallel ingestion:
# Split into 100MB chunks
split -b 100m corpus.txt shard_

# Process with glob pattern
myspellchecker build --input "shard_*" --output dict.db

Quality Optimization

Frequency Thresholds

Balance coverage vs. noise:
# Include rare words (noisy)
config = PipelineConfig(min_frequency=1)

# Standard (balanced, default)
config = PipelineConfig(min_frequency=50)

# Only common words (clean)
config = PipelineConfig(min_frequency=100)

Database Size

min_frequencyDB Size (10M word corpus)
1~200MB
50~100MB
100~50MB

Database Optimization

Index Strategy

Indexes are created automatically for fast lookups. The database includes:
  • idx_syllables_text - Syllable text lookups
  • idx_words_text - Word text lookups
  • idx_bigrams_w1_w2 - Bigram lookups
  • idx_trigrams_w1_w2_w3 - Trigram lookups

Vacuum

Database is automatically compacted after building:
from myspellchecker.data_pipeline import DatabasePackager

# DatabasePackager takes input_dir and database_path directly
packager = DatabasePackager(input_dir="/path/to/input", database_path="output.db")
# Vacuum is applied automatically during packaging

Segmentation Optimization

Segmenter Selection

Choose segmenter based on needs:
# Fastest (for quick builds)
config = PipelineConfig(word_engine="crf")

# Best accuracy (for production)
config = PipelineConfig(word_engine="myword")

Cython Acceleration

Ensure Cython extensions are compiled:
python setup.py build_ext --inplace

Benchmarking

Measure Build Time

import time

start = time.time()
pipeline.build_database(input_files, database_path)
elapsed = time.time() - start

print(f"Build completed in {elapsed:.1f}s")

Profile Memory

import tracemalloc

tracemalloc.start()
pipeline.build_database(input_files, database_path)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

print(f"Peak memory: {peak / 1024 / 1024:.1f} MB")

Troubleshooting

Out of Memory

# Reduce batch size
config = PipelineConfig(batch_size=5000)

# Increase shards for better memory distribution
config = PipelineConfig(num_shards=50)

# Limit workers
config = PipelineConfig(num_workers=2)

Slow Build

# Increase workers
config = PipelineConfig(num_workers=8)

# Increase batch size
config = PipelineConfig(batch_size=50000)

# Use faster segmenter
config = PipelineConfig(word_engine="crf")
Install DuckDB for significant speedup on all corpus sizes:
pip install duckdb>=1.0.0
DuckDB is used by default when installed, providing 3-50x faster frequency counting via SQL-based aggregation.

Large Output Database

# Increase min_frequency
config = PipelineConfig(min_frequency=100)

Small Corpus (<100MB)

config = PipelineConfig(
    num_workers=4,
    batch_size=10000,
    min_frequency=10,
)

Large Corpus (1-10GB)

config = PipelineConfig(
    num_shards=50,
    num_workers=8,
    batch_size=50000,
    min_frequency=50,
)

Very Large Corpus (>10GB)

config = PipelineConfig(
    work_dir="/ssd/tmp",
    num_shards=100,
    num_workers=12,
    batch_size=100000,
    min_frequency=100,
)
Recommended: Install DuckDB (pip install duckdb>=1.0.0) for optimal performance. The FrequencyBuilder automatically uses DuckDB when installed, providing 10-50x faster processing for large files.

See Also