Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The data pipeline accepts six input formats. All files must be UTF-8 encoded. The simplest option is plain text with one sentence per line; structured formats (CSV, JSON, Parquet) support additional metadata like frequency and POS tags.
| Format | Extension | Use Case |
|---|
| Plain Text | .txt | Simple text corpora |
| CSV | .csv | Structured data with metadata |
| TSV | .tsv | Tab-separated structured data |
| JSON | .json | Complex data with nested fields |
| JSON Lines | .jsonl | Streaming JSON (one object per line) |
| Parquet | .parquet | Columnar format for large datasets |
Plain Text (.txt)
Simple UTF-8 text file with Myanmar content:
မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။
ကျေးဇူးတင်ပါသည်။
နေကောင်းလား။
Requirements
- Encoding: UTF-8 (required)
- Line endings: LF or CRLF
- Empty lines: Ignored
- Comments: Not supported (lines starting with
# are processed as normal text)
Best Practices
- One sentence per line (recommended)
- No HTML or markup
- Pre-normalize Unicode (NFC form)
Structured data with optional metadata columns:
text,frequency,pos
မြန်မာ,1000,N
နိုင်ငံ,800,N
သည်,5000,PART
ကြောင့်,2000,P_SUBJ
Column Specifications
| Column | Required | Type | Description |
|---|
text | Yes | string | Myanmar text |
frequency | No | integer | Corpus frequency |
pos | No | string | Part-of-speech tag |
syllables | No | string | Pre-segmented syllables |
source | No | string | Corpus source identifier |
Options
Configure column names via PipelineConfig:
config = PipelineConfig(
text_col="text", # Column name containing text
)
The JSON file must be a raw array (top-level list), not wrapped in an object like {"entries": [...]}:
[
{
"text": "မြန်မာ",
"frequency": 1000,
"pos": "N",
"syllables": ["မြန်", "မာ"]
},
{
"text": "နိုင်ငံ",
"frequency": 800,
"pos": "N"
}
]
Items can also be plain strings:
["မြန်မာနိုင်ငံ", "ကျေးဇူးတင်ပါသည်"]
JSON Lines (.jsonl)
One JSON object per line:
{"text": "မြန်မာ", "frequency": 1000, "pos": "N"}
{"text": "နိုင်ငံ", "frequency": 800, "pos": "N"}
Options
config = PipelineConfig(
json_key="text", # Key name containing text
)
Apache Parquet is a columnar storage format, ideal for large datasets:
import pyarrow as pa
import pyarrow.parquet as pq
# Create a Parquet file
table = pa.table({
"text": ["မြန်မာ", "နိုင်ငံ", "သည်"],
"frequency": [1000, 800, 5000],
"pos": ["N", "N", "PART"],
})
pq.write_table(table, "corpus.parquet")
Column Detection
The ingester automatically detects the text column:
- Primary: Looks for a column named
text
- Fallback: Uses the first string column in the schema
Advantages
- Compression: Efficient storage for large corpora
- Columnar: Fast reads for specific columns
- Type-safe: Schema enforcement
- Interoperability: Works with pandas, Spark, DuckDB
Example with pandas
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
"text": ["မြန်မာစာ", "ကောင်းပါတယ်"],
"source": ["wiki", "news"],
})
# Save as Parquet
df.to_parquet("corpus.parquet", index=False)
Large Files
Automatic Sharding
The pipeline automatically shards large files for memory-efficient processing:
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
config = PipelineConfig(
num_shards=50, # More shards for larger files
batch_size=50000,
)
pipeline = Pipeline(config=config)
pipeline.build_database(
input_files=["large_corpus.txt"],
database_path="dict.db",
)
Manual Sharding
Split large corpora into shards:
# Split into 100MB chunks
split -b 100m corpus.txt corpus_part_
# Process all parts
myspellchecker build --input "corpus_part_*" --output dict.db
Validation
# Validate corpus using build command with --validate flag
myspellchecker build --input corpus.txt --validate
# Output:
# Lines: 1,000,000
# Valid: 999,500 (99.95%)
# Invalid: 500 (0.05%)
# Encoding: UTF-8
# Format: Plain text
Note: There is no standalone validate subcommand. Use build --validate to validate input files.
Common Validation Errors
| Error | Cause | Solution |
|---|
Invalid encoding | Non-UTF-8 bytes | Convert to UTF-8 |
Invalid characters | Control chars | Clean input |
Empty lines | Missing content | Remove empty lines |
Zawgyi detected | Legacy encoding | Convert to Unicode |
Encoding Conversion
Zawgyi to Unicode
For standalone Zawgyi-to-Unicode conversion:
from myspellchecker.text.normalize import convert_zawgyi_to_unicode
# Read Zawgyi file
with open("zawgyi.txt", encoding="utf-8") as f:
text = f.read()
# Convert
unicode_text = convert_zawgyi_to_unicode(text)
# Save as Unicode
with open("unicode.txt", "w", encoding="utf-8") as f:
f.write(unicode_text)
Note: The data pipeline internally uses normalize_with_zawgyi_conversion(),
which combines Zawgyi conversion with full text normalization (Unicode NFC,
zero-width character removal, etc.). You do not need to pre-convert Zawgyi
files before feeding them to the pipeline.
See Also