Supported Formats
| Format | Extension | Use Case |
|---|---|---|
| Plain Text | .txt | Simple text corpora |
| CSV | .csv | Structured data with metadata |
| TSV | .tsv | Tab-separated structured data |
| JSON | .json | Complex data with nested fields |
| JSON Lines | .jsonl | Streaming JSON (one object per line) |
| Parquet | .parquet | Columnar format for large datasets |
Plain Text (.txt)
Simple UTF-8 text file with Myanmar content:Requirements
- Encoding: UTF-8 (required)
- Line endings: LF or CRLF
- Empty lines: Ignored
- Comments: Not supported (lines starting with
#are processed as normal text)
Best Practices
- One sentence per line (recommended)
- No HTML or markup
- Pre-normalize Unicode (NFC form)
CSV Format (.csv)
Structured data with optional metadata columns:Column Specifications
| Column | Required | Type | Description |
|---|---|---|---|
text | Yes | string | Myanmar text |
frequency | No | integer | Corpus frequency |
pos | No | string | Part-of-speech tag |
syllables | No | string | Pre-segmented syllables |
source | No | string | Corpus source identifier |
Options
Configure column names viaPipelineConfig:
JSON Format (.json)
Raw Array Format
The JSON file must be a raw array (top-level list), not wrapped in an object like{"entries": [...]}:
JSON Lines (.jsonl)
One JSON object per line:Options
Parquet Format (.parquet)
Apache Parquet is a columnar storage format, ideal for large datasets:Column Detection
The ingester automatically detects the text column:- Primary: Looks for a column named
text - Fallback: Uses the first string column in the schema
Advantages
- Compression: Efficient storage for large corpora
- Columnar: Fast reads for specific columns
- Type-safe: Schema enforcement
- Interoperability: Works with pandas, Spark, DuckDB
Example with pandas
Large Files
Automatic Sharding
The pipeline automatically shards large files for memory-efficient processing:Manual Sharding
Split large corpora into shards:Validation
Check Format Before Building
validate subcommand. Use build --validate to validate input files.
Common Validation Errors
| Error | Cause | Solution |
|---|---|---|
Invalid encoding | Non-UTF-8 bytes | Convert to UTF-8 |
Invalid characters | Control chars | Clean input |
Empty lines | Missing content | Remove empty lines |
Zawgyi detected | Legacy encoding | Convert to Unicode |
Encoding Conversion
Zawgyi to Unicode
For standalone Zawgyi-to-Unicode conversion:
Note: The data pipeline internally uses normalize_with_zawgyi_conversion(),
which combines Zawgyi conversion with full text normalization (Unicode NFC,
zero-width character removal, etc.). You do not need to pre-convert Zawgyi
files before feeding them to the pipeline.
See Also
- Data Pipeline Index - Pipeline overview
- Database Schema - Output format
- Building Guide - Build process