Overview
The dictionary database contains tables for syllables, words, N-grams, and metadata.Core Tables
syllables
Stores valid Myanmar syllables:| Column | Type | Description |
|---|---|---|
id | INTEGER | Primary key |
syllable | TEXT | Syllable text (unique) |
frequency | INTEGER | Corpus frequency count |
words
Stores dictionary words:| Column | Type | Description |
|---|---|---|
id | INTEGER | Primary key |
word | TEXT | Word text (unique) |
syllable_count | INTEGER | Number of syllables |
frequency | INTEGER | Corpus frequency |
pos_tag | TEXT | POS tag from corpus |
is_curated | INTEGER | Whether word is curated (0/1) |
inferred_pos | TEXT | POS tag from inference |
inferred_confidence | REAL | Confidence of inferred POS |
inferred_source | TEXT | Source of inference |
--curated-input are inserted directly with is_curated=1 before corpus processing. When corpus words are loaded:
| Scenario | frequency | is_curated |
|---|---|---|
| Curated only | 0 | 1 |
| Curated + Corpus | corpus_freq | 1 |
| Corpus only | corpus_freq | 0 |
bigrams
Stores word bigram frequencies (using word IDs for efficiency):| Column | Type | Description |
|---|---|---|
word1_id | INTEGER | Foreign key to first word |
word2_id | INTEGER | Foreign key to second word |
probability | REAL | P(word2 | word1) |
count | INTEGER | Raw co-occurrence count |
trigrams
Stores word trigram frequencies:Higher-Order N-gram Tables
fourgrams
Stores 4-gram conditional probabilities for deeper context analysis:fivegrams
Stores 5-gram conditional probabilities:POS Probability Tables
pos_unigrams
Stores POS unigram probabilities:pos_bigrams
Stores POS bigram probabilities:pos_trigrams
Stores POS trigram probabilities:File Tracking Table
processed_files
Tracks processed files for incremental builds:| Column | Type | Description |
|---|---|---|
path | TEXT | File path (unique) |
mtime | REAL | File modification time |
size | INTEGER | File size in bytes |
Metadata Table
metadata
Stores key-value metadata about the database build:Enrichment Tables
These tables are populated during the enrichment step (--no-enrich to skip).
confusable_pairs
Stores phonetically or orthographically similar word pairs mined from the corpus:| Column | Type | Description |
|---|---|---|
word1 | TEXT | First word in the confusable pair |
word2 | TEXT | Second word (the confusable variant) |
confusion_type | TEXT | Type of confusion (aspiration, medial, tone, nasal) |
context_overlap | REAL | Context overlap score between the two words |
freq_ratio | REAL | Frequency ratio between the two words |
suppress | INTEGER | Whether this pair is suppressed (0=active, 1=suppressed) |
source | TEXT | Source of the pair (mined, curated) |
compound_confusions
Stores compound words that may be incorrectly split during segmentation:collocations
Stores word collocations with PMI (Pointwise Mutual Information) scores:register_tags
Stores formal/informal register classification for words:| Column | Type | Description |
|---|---|---|
word | TEXT | The word |
register | TEXT | Register classification (formal, informal, neutral) |
confidence | REAL | Classification confidence score |
formal_count | INTEGER | Count of formal context occurrences |
informal_count | INTEGER | Count of informal context occurrences |
Query Examples
Lookup Syllable
Get Word with POS
Get Bigram Probability
Get Top Continuations
Get POS Transition Probability
Database Optimization
Indexes
Critical indexes for performance:VACUUM
Compact database after building:Page Size
Optimize for read performance:Schema Migration
Version Tracking
Migration Example
See Also
- Data Pipeline Index - Pipeline overview
- Corpus Format - Input formats
- API Reference - Provider API