Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The first step in the dictionary building pipeline converts raw corpus files (TXT, CSV, JSON, Parquet) into Arrow shards for efficient parallel processing downstream.
Overview
| Format | Extension | Use Case |
|---|
| Plain Text | .txt | Raw corpus text |
| CSV | .csv | Structured word lists |
| TSV | .tsv | Tab-separated word lists |
| JSON | .json | Complex data with metadata |
| JSON Lines | .jsonl | Streaming JSON |
| Parquet | .parquet | Columnar format for large datasets |
Using the CorpusIngester
Basic Usage
The CorpusIngester reads corpus files and outputs Arrow shards for efficient processing:
from myspellchecker.data_pipeline import CorpusIngester
ingester = CorpusIngester()
# Generate sample corpus for testing
sample_lines = ingester.generate_sample_corpus()
# Process files into Arrow shards (used internally by Pipeline)
# The ingester produces Arrow shard files in the work directory
With Pipeline (Recommended)
For most use cases, use the Pipeline class which manages ingestion automatically:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
config = PipelineConfig(
text_col="text", # Column name for CSV/TSV
json_key="text", # Key name for JSON
)
pipeline = Pipeline(config=config)
pipeline.build_database(
input_files=["corpus.txt"],
database_path="dict.db",
)
Configuration Options
Configuration is handled through PipelineConfig and IngesterConfig:
from myspellchecker.data_pipeline import PipelineConfig, IngesterConfig
# Pipeline-level configuration
pipeline_config = PipelineConfig(
text_col="text", # Column name for CSV/TSV
json_key="text", # Key name for JSON
num_shards=20, # Number of Arrow shards to create
)
# Ingester-specific configuration
ingester_config = IngesterConfig(
batch_size=10000, # Records per batch for Arrow writing
encoding="utf-8", # File encoding
skip_empty_lines=True, # Skip empty lines during ingestion
normalize_unicode=True, # Apply Unicode normalization
)
| Option | Default | Description |
|---|
batch_size | 10000 | Records per batch |
encoding | "utf-8" | File encoding |
skip_empty_lines | True | Skip empty lines |
normalize_unicode | True | Apply Unicode normalization |
Plain Text
# Text files are read line by line
pipeline = Pipeline(config=PipelineConfig())
pipeline.build_database(
input_files=["corpus.txt"],
database_path="dict.db",
)
CSV/TSV
config = PipelineConfig(
text_col="text", # Column name containing text
)
JSON
config = PipelineConfig(
json_key="text", # Key name containing text
)
Parquet
Parquet files are read using PyArrow. The ingester automatically detects the text column:
- Looks for a column named
text
- Falls back to the first string/large_string column
# Create a Parquet corpus file
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"text": ["မြန်မာစာ", "ကောင်းပါတယ်"]})
pq.write_table(table, "corpus.parquet")
# Then use with pipeline
pipeline.build_database(
input_files=["corpus.parquet"],
database_path="dict.db",
)
Streaming Large Files
The pipeline automatically handles large files by streaming and sharding:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
# Large files are automatically handled with parallel sharding
config = PipelineConfig(
num_shards=50, # More shards for better parallelization
batch_size=50000, # Larger batches for throughput
)
pipeline = Pipeline(config=config)
pipeline.build_database(
input_files=["large_corpus.txt"],
database_path="dict.db",
)
Error Handling
The ingester validates input files and raises IngestionError for missing files:
from myspellchecker.data_pipeline import Pipeline, IngestionError
pipeline = Pipeline()
try:
pipeline.build_database(
input_files=["missing.txt"],
database_path="dict.db",
)
except IngestionError as e:
print(f"Ingestion failed: {e}")
print(f"Missing files: {e.missing_files}")
Multiple Paths
pipeline = Pipeline()
pipeline.build_database(
input_files=[
"corpus1.txt",
"corpus2.txt",
"data/corpus3.csv",
],
database_path="dict.db",
)
Glob Patterns (via CLI)
myspellchecker build --input "corpus/*.txt" --output dict.db
- Use multiple workers for parallel ingestion
- Increase num_shards for large corpora
- Use SSD for faster I/O
# Optimized for large files
config = PipelineConfig(
num_shards=50, # More shards for parallelization
num_workers=8, # More workers
batch_size=50000, # Larger batches
)
Integration with Pipeline
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
config = PipelineConfig(
text_col="text", # For CSV files
json_key="text", # For JSON files
)
pipeline = Pipeline(config=config)
pipeline.build_database(
input_files=["corpus.txt"],
database_path="dict.db",
)
See Also