Overview

Supported Formats
| Format | Extension | Use Case |
|---|---|---|
| Plain Text | .txt | Raw corpus text |
| CSV | .csv | Structured word lists |
| TSV | .tsv | Tab-separated word lists |
| JSON | .json | Complex data with metadata |
| JSON Lines | .jsonl | Streaming JSON |
| Parquet | .parquet | Columnar format for large datasets |
Using the CorpusIngester
Basic Usage
TheCorpusIngester reads corpus files and outputs Arrow shards for efficient processing:
With Pipeline (Recommended)
For most use cases, use thePipeline class which manages ingestion automatically:
Configuration Options
Configuration is handled throughPipelineConfig and IngesterConfig:
| Option | Default | Description |
|---|---|---|
batch_size | 10000 | Records per batch |
encoding | "utf-8" | File encoding |
skip_empty_lines | True | Skip empty lines |
normalize_unicode | True | Apply Unicode normalization |
Format-Specific Options
Plain Text
CSV/TSV
JSON
Parquet
Parquet files are read using PyArrow. The ingester automatically detects the text column:- Looks for a column named
text - Falls back to the first string/large_string column
Streaming Large Files
The pipeline automatically handles large files by streaming and sharding:Error Handling
The ingester validates input files and raisesIngestionError for missing files:
Multiple Input Files
Multiple Paths
Glob Patterns (via CLI)
Performance Tips
- Use multiple workers for parallel ingestion
- Increase num_shards for large corpora
- Use SSD for faster I/O
Integration with Pipeline
See Also
- Corpus Format - Input specifications
- Processing Stage - Next pipeline stage
- Pipeline Index - Pipeline overview