Skip to main content
The dictionary database is a single SQLite file containing tables for syllables, words, N-gram probabilities, POS statistics, and file-tracking metadata. This page documents every table, column, index, and common query pattern.

Overview

The dictionary database contains tables for syllables, words, N-grams, and metadata.

Core Tables

syllables

Stores valid Myanmar syllables:
CREATE TABLE syllables (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    syllable TEXT UNIQUE NOT NULL,
    frequency INTEGER DEFAULT 0
);

CREATE INDEX idx_syllables_text ON syllables(syllable);
ColumnTypeDescription
idINTEGERPrimary key
syllableTEXTSyllable text (unique)
frequencyINTEGERCorpus frequency count

words

Stores dictionary words:
CREATE TABLE words (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word TEXT UNIQUE NOT NULL,
    syllable_count INTEGER,
    frequency INTEGER DEFAULT 0,
    pos_tag TEXT,
    is_curated INTEGER DEFAULT 0,
    inferred_pos TEXT,
    inferred_confidence REAL,
    inferred_source TEXT
);

CREATE INDEX idx_words_text ON words(word);
ColumnTypeDescription
idINTEGERPrimary key
wordTEXTWord text (unique)
syllable_countINTEGERNumber of syllables
frequencyINTEGERCorpus frequency
pos_tagTEXTPOS tag from corpus
is_curatedINTEGERWhether word is curated (0/1)
inferred_posTEXTPOS tag from inference
inferred_confidenceREALConfidence of inferred POS
inferred_sourceTEXTSource of inference
Curated Words: Words from --curated-input are inserted directly with is_curated=1 before corpus processing. When corpus words are loaded:
-- ON CONFLICT preserves is_curated=1 for curated words
INSERT INTO words (word, syllable_count, frequency, pos_tag, is_curated)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(word) DO UPDATE SET
    frequency=excluded.frequency,
    syllable_count=excluded.syllable_count,
    pos_tag=COALESCE(excluded.pos_tag, words.pos_tag),
    is_curated=MAX(words.is_curated, excluded.is_curated)
Scenariofrequencyis_curated
Curated only01
Curated + Corpuscorpus_freq1
Corpus onlycorpus_freq0

bigrams

Stores word bigram frequencies (using word IDs for efficiency):
CREATE TABLE bigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id)
);

CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);
ColumnTypeDescription
word1_idINTEGERForeign key to first word
word2_idINTEGERForeign key to second word
probabilityREALP(word2 | word1)
countINTEGERRaw co-occurrence count

trigrams

Stores word trigram frequencies:
CREATE TABLE trigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    word3_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    FOREIGN KEY(word3_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id, word3_id)
);

CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);

Higher-Order N-gram Tables

fourgrams

Stores 4-gram conditional probabilities for deeper context analysis:
CREATE TABLE fourgrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    word3_id INTEGER,
    word4_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    FOREIGN KEY(word3_id) REFERENCES words(id),
    FOREIGN KEY(word4_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id, word3_id, word4_id)
);

CREATE INDEX idx_fourgrams_w1_w2_w3_w4 ON fourgrams(word1_id, word2_id, word3_id, word4_id);

fivegrams

Stores 5-gram conditional probabilities:
CREATE TABLE fivegrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    word3_id INTEGER,
    word4_id INTEGER,
    word5_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    FOREIGN KEY(word3_id) REFERENCES words(id),
    FOREIGN KEY(word4_id) REFERENCES words(id),
    FOREIGN KEY(word5_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id, word3_id, word4_id, word5_id)
);

CREATE INDEX idx_fivegrams_w1_w2_w3_w4_w5 ON fivegrams(word1_id, word2_id, word3_id, word4_id, word5_id);

POS Probability Tables

pos_unigrams

Stores POS unigram probabilities:
CREATE TABLE pos_unigrams (
    pos TEXT UNIQUE NOT NULL,
    probability REAL DEFAULT 0.0
);

pos_bigrams

Stores POS bigram probabilities:
CREATE TABLE pos_bigrams (
    pos1 TEXT NOT NULL,
    pos2 TEXT NOT NULL,
    probability REAL DEFAULT 0.0,
    UNIQUE(pos1, pos2)
);

pos_trigrams

Stores POS trigram probabilities:
CREATE TABLE pos_trigrams (
    pos1 TEXT NOT NULL,
    pos2 TEXT NOT NULL,
    pos3 TEXT NOT NULL,
    probability REAL DEFAULT 0.0,
    UNIQUE(pos1, pos2, pos3)
);

File Tracking Table

processed_files

Tracks processed files for incremental builds:
CREATE TABLE processed_files (
    path TEXT PRIMARY KEY,
    mtime REAL,
    size INTEGER
);
ColumnTypeDescription
pathTEXTFile path (unique)
mtimeREALFile modification time
sizeINTEGERFile size in bytes

Metadata Table

metadata

Stores key-value metadata about the database build:
CREATE TABLE metadata (
    key TEXT PRIMARY KEY,
    value TEXT NOT NULL
);

Enrichment Tables

These tables are populated during the enrichment step (--no-enrich to skip).

confusable_pairs

Stores phonetically or orthographically similar word pairs mined from the corpus:
CREATE TABLE confusable_pairs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1 TEXT NOT NULL,
    word2 TEXT NOT NULL,
    confusion_type TEXT NOT NULL,
    context_overlap REAL DEFAULT 0.0,
    freq_ratio REAL,
    suppress INTEGER DEFAULT 0,
    source TEXT DEFAULT 'mined',
    UNIQUE(word1, word2, confusion_type)
);

CREATE INDEX idx_confusable_word1 ON confusable_pairs(word1);
CREATE INDEX idx_confusable_word2 ON confusable_pairs(word2);
ColumnTypeDescription
word1TEXTFirst word in the confusable pair
word2TEXTSecond word (the confusable variant)
confusion_typeTEXTType of confusion (aspiration, medial, tone, nasal)
context_overlapREALContext overlap score between the two words
freq_ratioREALFrequency ratio between the two words
suppressINTEGERWhether this pair is suppressed (0=active, 1=suppressed)
sourceTEXTSource of the pair (mined, curated)

compound_confusions

Stores compound words that may be incorrectly split during segmentation:
CREATE TABLE compound_confusions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    compound TEXT NOT NULL,
    part1 TEXT NOT NULL,
    part2 TEXT NOT NULL,
    compound_freq INTEGER DEFAULT 0,
    split_freq INTEGER DEFAULT 0,
    pmi REAL DEFAULT 0.0,
    UNIQUE(compound, part1, part2)
);

CREATE INDEX idx_compound_word ON compound_confusions(compound);
CREATE INDEX idx_compound_parts ON compound_confusions(part1, part2);

collocations

Stores word collocations with PMI (Pointwise Mutual Information) scores:
CREATE TABLE collocations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1 TEXT NOT NULL,
    word2 TEXT NOT NULL,
    pmi REAL NOT NULL,
    npmi REAL,
    count INTEGER NOT NULL,
    UNIQUE(word1, word2)
);

CREATE INDEX idx_colloc_word1 ON collocations(word1);
CREATE INDEX idx_colloc_word2 ON collocations(word2);

register_tags

Stores formal/informal register classification for words:
CREATE TABLE register_tags (
    word TEXT PRIMARY KEY,
    register TEXT NOT NULL,
    confidence REAL DEFAULT 0.0,
    formal_count INTEGER DEFAULT 0,
    informal_count INTEGER DEFAULT 0
);
ColumnTypeDescription
wordTEXTThe word
registerTEXTRegister classification (formal, informal, neutral)
confidenceREALClassification confidence score
formal_countINTEGERCount of formal context occurrences
informal_countINTEGERCount of informal context occurrences

Query Examples

Lookup Syllable

cursor.execute("""
    SELECT frequency
    FROM syllables
    WHERE syllable = ?
""", ("မြန်",))

Get Word with POS

cursor.execute("""
    SELECT word, frequency, pos_tag, inferred_pos
    FROM words
    WHERE word = ?
""", ("မြန်မာ",))

Get Bigram Probability

cursor.execute("""
    SELECT b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))

Get Top Continuations

cursor.execute("""
    SELECT w2.word, b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ?
    ORDER BY b.probability DESC
    LIMIT 10
""", ("ထမင်း",))

Get POS Transition Probability

cursor.execute("""
    SELECT probability
    FROM pos_bigrams
    WHERE pos1 = ? AND pos2 = ?
""", ("N", "V"))

Database Optimization

Indexes

Critical indexes for performance:
-- Text lookups
CREATE INDEX idx_syllables_text ON syllables(syllable);
CREATE INDEX idx_words_text ON words(word);

-- N-gram lookups (using word IDs)
CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);
CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);

VACUUM

Compact database after building:
VACUUM;
ANALYZE;

Page Size

Optimize for read performance:
PRAGMA page_size = 4096;
PRAGMA cache_size = 10000;

Schema Migration

Version Tracking

def get_schema_version(conn):
    cursor = conn.execute(
        "SELECT value FROM metadata WHERE key = 'version'"
    )
    row = cursor.fetchone()
    return row[0] if row else "1.0.0"

Migration Example

def migrate_v1_to_v2(conn):
    # Add new column
    conn.execute("ALTER TABLE words ADD COLUMN pos_tags TEXT")

    # Update metadata
    conn.execute(
        "UPDATE metadata SET value = '2.0.0' WHERE key = 'version'"
    )
    conn.commit()

See Also