Skip to main content
The first step in the dictionary building pipeline converts raw corpus files (TXT, CSV, JSON, Parquet) into Arrow shards for efficient parallel processing downstream.

Overview

Ingestion stage overview showing input files flowing through CorpusIngester to Arrow shards

Supported Formats

FormatExtensionUse Case
Plain Text.txtRaw corpus text
CSV.csvStructured word lists
TSV.tsvTab-separated word lists
JSON.jsonComplex data with metadata
JSON Lines.jsonlStreaming JSON
Parquet.parquetColumnar format for large datasets

Using the CorpusIngester

Basic Usage

The CorpusIngester reads corpus files and outputs Arrow shards for efficient processing:
from myspellchecker.data_pipeline import CorpusIngester

ingester = CorpusIngester()

# Generate sample corpus for testing
sample_lines = ingester.generate_sample_corpus()

# Process files into Arrow shards (used internally by Pipeline)
# The ingester produces Arrow shard files in the work directory
For most use cases, use the Pipeline class which manages ingestion automatically:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    text_col="text",         # Column name for CSV/TSV
    json_key="text",         # Key name for JSON
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

Configuration Options

Configuration is handled through PipelineConfig and IngesterConfig:
from myspellchecker.data_pipeline import PipelineConfig, IngesterConfig

# Pipeline-level configuration
pipeline_config = PipelineConfig(
    text_col="text",         # Column name for CSV/TSV
    json_key="text",         # Key name for JSON
    num_shards=20,           # Number of Arrow shards to create
)

# Ingester-specific configuration
ingester_config = IngesterConfig(
    batch_size=10000,        # Records per batch for Arrow writing
    encoding="utf-8",        # File encoding
    skip_empty_lines=True,   # Skip empty lines during ingestion
    normalize_unicode=True,  # Apply Unicode normalization
)
OptionDefaultDescription
batch_size10000Records per batch
encoding"utf-8"File encoding
skip_empty_linesTrueSkip empty lines
normalize_unicodeTrueApply Unicode normalization

Format-Specific Options

Plain Text

# Text files are read line by line
pipeline = Pipeline(config=PipelineConfig())
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

CSV/TSV

config = PipelineConfig(
    text_col="text",  # Column name containing text
)

JSON

config = PipelineConfig(
    json_key="text",  # Key name containing text
)

Parquet

Parquet files are read using PyArrow. The ingester automatically detects the text column:
  1. Looks for a column named text
  2. Falls back to the first string/large_string column
# Create a Parquet corpus file
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({"text": ["မြန်မာစာ", "ကောင်းပါတယ်"]})
pq.write_table(table, "corpus.parquet")

# Then use with pipeline
pipeline.build_database(
    input_files=["corpus.parquet"],
    database_path="dict.db",
)

Streaming Large Files

The pipeline automatically handles large files by streaming and sharding:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

# Large files are automatically handled with parallel sharding
config = PipelineConfig(
    num_shards=50,       # More shards for better parallelization
    batch_size=50000,    # Larger batches for throughput
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["large_corpus.txt"],
    database_path="dict.db",
)

Error Handling

The ingester validates input files and raises IngestionError for missing files:
from myspellchecker.data_pipeline import Pipeline, IngestionError

pipeline = Pipeline()

try:
    pipeline.build_database(
        input_files=["missing.txt"],
        database_path="dict.db",
    )
except IngestionError as e:
    print(f"Ingestion failed: {e}")
    print(f"Missing files: {e.missing_files}")

Multiple Input Files

Multiple Paths

pipeline = Pipeline()
pipeline.build_database(
    input_files=[
        "corpus1.txt",
        "corpus2.txt",
        "data/corpus3.csv",
    ],
    database_path="dict.db",
)

Glob Patterns (via CLI)

myspellchecker build --input "corpus/*.txt" --output dict.db

Performance Tips

  1. Use multiple workers for parallel ingestion
  2. Increase num_shards for large corpora
  3. Use SSD for faster I/O
# Optimized for large files
config = PipelineConfig(
    num_shards=50,      # More shards for parallelization
    num_workers=8,      # More workers
    batch_size=50000,   # Larger batches
)

Integration with Pipeline

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    text_col="text",         # For CSV files
    json_key="text",         # For JSON files
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

See Also