Skip to main content
The data pipeline accepts six input formats. All files must be UTF-8 encoded. The simplest option is plain text with one sentence per line; structured formats (CSV, JSON, Parquet) support additional metadata like frequency and POS tags.

Supported Formats

FormatExtensionUse Case
Plain Text.txtSimple text corpora
CSV.csvStructured data with metadata
TSV.tsvTab-separated structured data
JSON.jsonComplex data with nested fields
JSON Lines.jsonlStreaming JSON (one object per line)
Parquet.parquetColumnar format for large datasets

Plain Text (.txt)

Simple UTF-8 text file with Myanmar content:
မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။
ကျေးဇူးတင်ပါသည်။
နေကောင်းလား။

Requirements

  • Encoding: UTF-8 (required)
  • Line endings: LF or CRLF
  • Empty lines: Ignored
  • Comments: Not supported (lines starting with # are processed as normal text)

Best Practices

  1. One sentence per line (recommended)
  2. No HTML or markup
  3. Pre-normalize Unicode (NFC form)

CSV Format (.csv)

Structured data with optional metadata columns:
text,frequency,pos
မြန်မာ,1000,N
နိုင်ငံ,800,N
သည်,5000,PART
ကြောင့်,2000,P_SUBJ

Column Specifications

ColumnRequiredTypeDescription
textYesstringMyanmar text
frequencyNointegerCorpus frequency
posNostringPart-of-speech tag
syllablesNostringPre-segmented syllables
sourceNostringCorpus source identifier

Options

Configure column names via PipelineConfig:
config = PipelineConfig(
    text_col="text",     # Column name containing text
)

JSON Format (.json)

Raw Array Format

The JSON file must be a raw array (top-level list), not wrapped in an object like {"entries": [...]}:
[
  {
    "text": "မြန်မာ",
    "frequency": 1000,
    "pos": "N",
    "syllables": ["မြန်", "မာ"]
  },
  {
    "text": "နိုင်ငံ",
    "frequency": 800,
    "pos": "N"
  }
]
Items can also be plain strings:
["မြန်မာနိုင်ငံ", "ကျေးဇူးတင်ပါသည်"]

JSON Lines (.jsonl)

One JSON object per line:
{"text": "မြန်မာ", "frequency": 1000, "pos": "N"}
{"text": "နိုင်ငံ", "frequency": 800, "pos": "N"}

Options

config = PipelineConfig(
    json_key="text",     # Key name containing text
)

Parquet Format (.parquet)

Apache Parquet is a columnar storage format, ideal for large datasets:
import pyarrow as pa
import pyarrow.parquet as pq

# Create a Parquet file
table = pa.table({
    "text": ["မြန်မာ", "နိုင်ငံ", "သည်"],
    "frequency": [1000, 800, 5000],
    "pos": ["N", "N", "PART"],
})
pq.write_table(table, "corpus.parquet")

Column Detection

The ingester automatically detects the text column:
  1. Primary: Looks for a column named text
  2. Fallback: Uses the first string column in the schema

Advantages

  • Compression: Efficient storage for large corpora
  • Columnar: Fast reads for specific columns
  • Type-safe: Schema enforcement
  • Interoperability: Works with pandas, Spark, DuckDB

Example with pandas

import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    "text": ["မြန်မာစာ", "ကောင်းပါတယ်"],
    "source": ["wiki", "news"],
})

# Save as Parquet
df.to_parquet("corpus.parquet", index=False)

Large Files

Automatic Sharding

The pipeline automatically shards large files for memory-efficient processing:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    num_shards=50,       # More shards for larger files
    batch_size=50000,
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["large_corpus.txt"],
    database_path="dict.db",
)

Manual Sharding

Split large corpora into shards:
# Split into 100MB chunks
split -b 100m corpus.txt corpus_part_

# Process all parts
myspellchecker build --input "corpus_part_*" --output dict.db

Validation

Check Format Before Building

# Validate corpus using build command with --validate flag
myspellchecker build --input corpus.txt --validate

# Output:
# Lines: 1,000,000
# Valid: 999,500 (99.95%)
# Invalid: 500 (0.05%)
# Encoding: UTF-8
# Format: Plain text
Note: There is no standalone validate subcommand. Use build --validate to validate input files.

Common Validation Errors

ErrorCauseSolution
Invalid encodingNon-UTF-8 bytesConvert to UTF-8
Invalid charactersControl charsClean input
Empty linesMissing contentRemove empty lines
Zawgyi detectedLegacy encodingConvert to Unicode

Encoding Conversion

Zawgyi to Unicode

For standalone Zawgyi-to-Unicode conversion:
from myspellchecker.text.normalize import convert_zawgyi_to_unicode

# Read Zawgyi file
with open("zawgyi.txt", encoding="utf-8") as f:
    text = f.read()

# Convert
unicode_text = convert_zawgyi_to_unicode(text)

# Save as Unicode
with open("unicode.txt", "w", encoding="utf-8") as f:
    f.write(unicode_text)
Note: The data pipeline internally uses normalize_with_zawgyi_conversion(), which combines Zawgyi conversion with full text normalization (Unicode NFC, zero-width character removal, etc.). You do not need to pre-convert Zawgyi files before feeding them to the pipeline.

See Also