Corpus Format Specification

The data pipeline accepts six input formats. All files must be UTF-8 encoded. The simplest option is plain text with one sentence per line; structured formats (CSV, JSON, Parquet) support additional metadata like frequency and POS tags.

Supported Formats

Format	Extension	Use Case
Plain Text	`.txt`	Simple text corpora
CSV	`.csv`	Structured data with metadata
TSV	`.tsv`	Tab-separated structured data
JSON	`.json`	Complex data with nested fields
JSON Lines	`.jsonl`	Streaming JSON (one object per line)
Parquet	`.parquet`	Columnar format for large datasets

Plain Text (.txt)

Simple UTF-8 text file with Myanmar content:

မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။
ကျေးဇူးတင်ပါသည်။
နေကောင်းလား။

Requirements

Encoding: UTF-8 (required)
Line endings: LF or CRLF
Empty lines: Ignored
Comments: Not supported (lines starting with # are processed as normal text)

Best Practices

One sentence per line (recommended)
No HTML or markup
Pre-normalize Unicode (NFC form)

CSV Format (.csv)

Structured data with optional metadata columns:

text,frequency,pos
မြန်မာ,1000,N
နိုင်ငံ,800,N
သည်,5000,PART
ကြောင့်,2000,P_SUBJ

Column Specifications

Column	Required	Type	Description
`text`	Yes	string	Myanmar text
`frequency`	No	integer	Corpus frequency
`pos`	No	string	Part-of-speech tag
`syllables`	No	string	Pre-segmented syllables
`source`	No	string	Corpus source identifier

Options

Configure column names via PipelineConfig:

config = PipelineConfig(
    text_col="text",     # Column name containing text
)

JSON Format (.json)

Raw Array Format

The JSON file must be a raw array (top-level list), not wrapped in an object like {"entries": [...]}:

[
  {
    "text": "မြန်မာ",
    "frequency": 1000,
    "pos": "N",
    "syllables": ["မြန်", "မာ"]
  },
  {
    "text": "နိုင်ငံ",
    "frequency": 800,
    "pos": "N"
  }
]

Items can also be plain strings:

["မြန်မာနိုင်ငံ", "ကျေးဇူးတင်ပါသည်"]

JSON Lines (.jsonl)

One JSON object per line:

{"text": "မြန်မာ", "frequency": 1000, "pos": "N"}
{"text": "နိုင်ငံ", "frequency": 800, "pos": "N"}

Options

config = PipelineConfig(
    json_key="text",     # Key name containing text
)

Parquet Format (.parquet)

Apache Parquet is a columnar storage format, ideal for large datasets:

import pyarrow as pa
import pyarrow.parquet as pq

# Create a Parquet file
table = pa.table({
    "text": ["မြန်မာ", "နိုင်ငံ", "သည်"],
    "frequency": [1000, 800, 5000],
    "pos": ["N", "N", "PART"],
})
pq.write_table(table, "corpus.parquet")

Column Detection

The ingester automatically detects the text column:

Primary: Looks for a column named text
Fallback: Uses the first string column in the schema

Advantages

Compression: Efficient storage for large corpora
Columnar: Fast reads for specific columns
Type-safe: Schema enforcement
Interoperability: Works with pandas, Spark, DuckDB

Example with pandas

import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    "text": ["မြန်မာစာ", "ကောင်းပါတယ်"],
    "source": ["wiki", "news"],
})

# Save as Parquet
df.to_parquet("corpus.parquet", index=False)

Large Files

Automatic Sharding

The pipeline automatically shards large files for memory-efficient processing:

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    num_shards=50,       # More shards for larger files
    batch_size=50000,
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["large_corpus.txt"],
    database_path="dict.db",
)

Manual Sharding

Split large corpora into shards:

# Split into 100MB chunks
split -b 100m corpus.txt corpus_part_

# Process all parts
myspellchecker build --input "corpus_part_*" --output dict.db

Validation

Check Format Before Building

# Validate corpus using build command with --validate flag
myspellchecker build --input corpus.txt --validate

# Output:
# Lines: 1,000,000
# Valid: 999,500 (99.95%)
# Invalid: 500 (0.05%)
# Encoding: UTF-8
# Format: Plain text

Note: There is no standalone validate subcommand. Use build --validate to validate input files.

Common Validation Errors

Error	Cause	Solution
`Invalid encoding`	Non-UTF-8 bytes	Convert to UTF-8
`Invalid characters`	Control chars	Clean input
`Empty lines`	Missing content	Remove empty lines
`Zawgyi detected`	Legacy encoding	Convert to Unicode

Encoding Conversion

Zawgyi to Unicode

For standalone Zawgyi-to-Unicode conversion:

from myspellchecker.text.normalize import convert_zawgyi_to_unicode

# Read Zawgyi file
with open("zawgyi.txt", encoding="utf-8") as f:
    text = f.read()

# Convert
unicode_text = convert_zawgyi_to_unicode(text)

# Save as Unicode
with open("unicode.txt", "w", encoding="utf-8") as f:
    f.write(unicode_text)

Note: The data pipeline internally uses normalize_with_zawgyi_conversion(), which combines Zawgyi conversion with full text normalization (Unicode NFC, zero-width character removal, etc.). You do not need to pre-convert Zawgyi files before feeding them to the pipeline.

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

Corpus Format Specification

Supported Formats

Plain Text (.txt)

Requirements

Best Practices

CSV Format (.csv)

Column Specifications

Options

JSON Format (.json)

Raw Array Format

JSON Lines (.jsonl)

Options

Parquet Format (.parquet)

Column Detection

Advantages

Example with pandas

Large Files

Automatic Sharding

Manual Sharding

Validation

Check Format Before Building

Common Validation Errors

Encoding Conversion

Zawgyi to Unicode

See Also

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​Supported Formats

​Plain Text (.txt)

​Requirements

​Best Practices

​CSV Format (.csv)

​Column Specifications

​Options

​JSON Format (.json)

​Raw Array Format

​JSON Lines (.jsonl)

​Options

​Parquet Format (.parquet)

​Column Detection

​Advantages

​Example with pandas

​Large Files

​Automatic Sharding

​Manual Sharding

​Validation

​Check Format Before Building

​Common Validation Errors

​Encoding Conversion

​Zawgyi to Unicode

​See Also

Supported Formats

Plain Text (.txt)

Requirements

Best Practices

CSV Format (.csv)

Column Specifications

Options

JSON Format (.json)

Raw Array Format

JSON Lines (.jsonl)

Options

Parquet Format (.parquet)

Column Detection

Advantages

Example with pandas

Large Files

Automatic Sharding

Manual Sharding

Validation

Check Format Before Building

Common Validation Errors

Encoding Conversion

Zawgyi to Unicode

See Also