Skip to main content
The dictionary database is a single SQLite file containing tables for syllables, words, N-gram probabilities, POS statistics, and file-tracking metadata. This page documents every table, column, index, and common query pattern.

Overview

The dictionary database contains tables for syllables, words, N-grams, and metadata.

Core Tables

syllables

Stores valid Myanmar syllables:
CREATE TABLE syllables (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    syllable TEXT UNIQUE NOT NULL,
    frequency INTEGER DEFAULT 0
);

CREATE INDEX idx_syllables_text ON syllables(syllable);
ColumnTypeDescription
idINTEGERPrimary key
syllableTEXTSyllable text (unique)
frequencyINTEGERCorpus frequency count

words

Stores dictionary words:
CREATE TABLE words (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word TEXT UNIQUE NOT NULL,
    syllable_count INTEGER,
    frequency INTEGER DEFAULT 0,
    pos_tag TEXT,
    is_curated INTEGER DEFAULT 0,
    inferred_pos TEXT,
    inferred_confidence REAL,
    inferred_source TEXT
);

CREATE INDEX idx_words_text ON words(word);
ColumnTypeDescription
idINTEGERPrimary key
wordTEXTWord text (unique)
syllable_countINTEGERNumber of syllables
frequencyINTEGERCorpus frequency
pos_tagTEXTPOS tag from corpus
is_curatedINTEGERWhether word is curated (0/1)
inferred_posTEXTPOS tag from inference
inferred_confidenceREALConfidence of inferred POS
inferred_sourceTEXTSource of inference
Curated Words: Words from --curated-input are inserted directly with is_curated=1 before corpus processing. When corpus words are loaded:
-- ON CONFLICT preserves is_curated=1 for curated words
INSERT INTO words (word, syllable_count, frequency, pos_tag, is_curated)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(word) DO UPDATE SET
    frequency=excluded.frequency,
    syllable_count=excluded.syllable_count,
    pos_tag=COALESCE(excluded.pos_tag, words.pos_tag),
    is_curated=MAX(words.is_curated, excluded.is_curated)
Scenariofrequencyis_curated
Curated only01
Curated + Corpuscorpus_freq1
Corpus onlycorpus_freq0

bigrams

Stores word bigram frequencies (using word IDs for efficiency):
CREATE TABLE bigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id)
);

CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);
ColumnTypeDescription
word1_idINTEGERForeign key to first word
word2_idINTEGERForeign key to second word
probabilityREALP(word2 | word1)
countINTEGERRaw co-occurrence count

trigrams

Stores word trigram frequencies:
CREATE TABLE trigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    word3_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    FOREIGN KEY(word3_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id, word3_id)
);

CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);

POS Probability Tables

pos_unigrams

Stores POS unigram probabilities:
CREATE TABLE pos_unigrams (
    pos TEXT UNIQUE NOT NULL,
    probability REAL DEFAULT 0.0
);

pos_bigrams

Stores POS bigram probabilities:
CREATE TABLE pos_bigrams (
    pos1 TEXT NOT NULL,
    pos2 TEXT NOT NULL,
    probability REAL DEFAULT 0.0,
    UNIQUE(pos1, pos2)
);

pos_trigrams

Stores POS trigram probabilities:
CREATE TABLE pos_trigrams (
    pos1 TEXT NOT NULL,
    pos2 TEXT NOT NULL,
    pos3 TEXT NOT NULL,
    probability REAL DEFAULT 0.0,
    UNIQUE(pos1, pos2, pos3)
);

File Tracking Table

processed_files

Tracks processed files for incremental builds:
CREATE TABLE processed_files (
    path TEXT PRIMARY KEY,
    mtime REAL,
    size INTEGER
);
ColumnTypeDescription
pathTEXTFile path (unique)
mtimeREALFile modification time
sizeINTEGERFile size in bytes

Query Examples

Lookup Syllable

cursor.execute("""
    SELECT frequency
    FROM syllables
    WHERE syllable = ?
""", ("မြန်",))

Get Word with POS

cursor.execute("""
    SELECT word, frequency, pos_tag, inferred_pos
    FROM words
    WHERE word = ?
""", ("မြန်မာ",))

Get Bigram Probability

cursor.execute("""
    SELECT b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))

Get Top Continuations

cursor.execute("""
    SELECT w2.word, b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ?
    ORDER BY b.probability DESC
    LIMIT 10
""", ("ထမင်း",))

Get POS Transition Probability

cursor.execute("""
    SELECT probability
    FROM pos_bigrams
    WHERE pos1 = ? AND pos2 = ?
""", ("N", "V"))

Database Optimization

Indexes

Critical indexes for performance:
-- Text lookups
CREATE INDEX idx_syllables_text ON syllables(syllable);
CREATE INDEX idx_words_text ON words(word);

-- N-gram lookups (using word IDs)
CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);
CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);

VACUUM

Compact database after building:
VACUUM;
ANALYZE;

Page Size

Optimize for read performance:
PRAGMA page_size = 4096;
PRAGMA cache_size = 10000;

Schema Migration

Version Tracking

def get_schema_version(conn):
    cursor = conn.execute(
        "SELECT value FROM schema_info WHERE key = 'version'"
    )
    row = cursor.fetchone()
    return row[0] if row else "1.0.0"

Migration Example

def migrate_v1_to_v2(conn):
    # Add new column
    conn.execute("ALTER TABLE words ADD COLUMN pos_tags TEXT")

    # Update metadata
    conn.execute(
        "UPDATE schema_info SET value = '2.0.0' WHERE key = 'version'"
    )
    conn.commit()

See Also