Database Schema

The dictionary database is a single SQLite file containing tables for syllables, words, N-gram probabilities, POS statistics, and file-tracking metadata. This page documents every table, column, index, and common query pattern.

Overview

The dictionary database contains tables for syllables, words, N-grams, and metadata.

Core Tables

syllables

Stores valid Myanmar syllables:

CREATE TABLE syllables (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    syllable TEXT UNIQUE NOT NULL,
    frequency INTEGER DEFAULT 0
);

CREATE INDEX idx_syllables_text ON syllables(syllable);

Column	Type	Description
`id`	INTEGER	Primary key
`syllable`	TEXT	Syllable text (unique)
`frequency`	INTEGER	Corpus frequency count

words

Stores dictionary words:

CREATE TABLE words (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word TEXT UNIQUE NOT NULL,
    syllable_count INTEGER,
    frequency INTEGER DEFAULT 0,
    pos_tag TEXT,
    is_curated INTEGER DEFAULT 0,
    inferred_pos TEXT,
    inferred_confidence REAL,
    inferred_source TEXT
);

CREATE INDEX idx_words_text ON words(word);

Column	Type	Description
`id`	INTEGER	Primary key
`word`	TEXT	Word text (unique)
`syllable_count`	INTEGER	Number of syllables
`frequency`	INTEGER	Corpus frequency
`pos_tag`	TEXT	POS tag from corpus
`is_curated`	INTEGER	Whether word is curated (0/1)
`inferred_pos`	TEXT	POS tag from inference
`inferred_confidence`	REAL	Confidence of inferred POS
`inferred_source`	TEXT	Source of inference

Curated Words: Words from --curated-input are inserted directly with is_curated=1 before corpus processing. When corpus words are loaded:

-- ON CONFLICT preserves is_curated=1 for curated words
INSERT INTO words (word, syllable_count, frequency, pos_tag, is_curated)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(word) DO UPDATE SET
    frequency=excluded.frequency,
    syllable_count=excluded.syllable_count,
    pos_tag=COALESCE(excluded.pos_tag, words.pos_tag),
    is_curated=MAX(words.is_curated, excluded.is_curated)

Scenario	frequency	is_curated
Curated only	0	1
Curated + Corpus	corpus_freq	1
Corpus only	corpus_freq	0

bigrams

Stores word bigram frequencies (using word IDs for efficiency):

CREATE TABLE bigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id)
);

CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);

Column	Type	Description
`word1_id`	INTEGER	Foreign key to first word
`word2_id`	INTEGER	Foreign key to second word
`probability`	REAL	P(word2 \| word1)
`count`	INTEGER	Raw co-occurrence count

trigrams

Stores word trigram frequencies:

CREATE TABLE trigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    word3_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    FOREIGN KEY(word3_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id, word3_id)
);

CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);

POS Probability Tables

pos_unigrams

Stores POS unigram probabilities:

CREATE TABLE pos_unigrams (
    pos TEXT UNIQUE NOT NULL,
    probability REAL DEFAULT 0.0
);

pos_bigrams

Stores POS bigram probabilities:

CREATE TABLE pos_bigrams (
    pos1 TEXT NOT NULL,
    pos2 TEXT NOT NULL,
    probability REAL DEFAULT 0.0,
    UNIQUE(pos1, pos2)
);

pos_trigrams

Stores POS trigram probabilities:

CREATE TABLE pos_trigrams (
    pos1 TEXT NOT NULL,
    pos2 TEXT NOT NULL,
    pos3 TEXT NOT NULL,
    probability REAL DEFAULT 0.0,
    UNIQUE(pos1, pos2, pos3)
);

File Tracking Table

processed_files

Tracks processed files for incremental builds:

CREATE TABLE processed_files (
    path TEXT PRIMARY KEY,
    mtime REAL,
    size INTEGER
);

Column	Type	Description
`path`	TEXT	File path (unique)
`mtime`	REAL	File modification time
`size`	INTEGER	File size in bytes

Query Examples

Lookup Syllable

cursor.execute("""
    SELECT frequency
    FROM syllables
    WHERE syllable = ?
""", ("မြန်",))

Get Word with POS

cursor.execute("""
    SELECT word, frequency, pos_tag, inferred_pos
    FROM words
    WHERE word = ?
""", ("မြန်မာ",))

Get Bigram Probability

cursor.execute("""
    SELECT b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))

Get Top Continuations

cursor.execute("""
    SELECT w2.word, b.probability
    FROM bigrams b
    JOIN words w1 ON b.word1_id = w1.id
    JOIN words w2 ON b.word2_id = w2.id
    WHERE w1.word = ?
    ORDER BY b.probability DESC
    LIMIT 10
""", ("ထမင်း",))

Get POS Transition Probability

cursor.execute("""
    SELECT probability
    FROM pos_bigrams
    WHERE pos1 = ? AND pos2 = ?
""", ("N", "V"))

Database Optimization

Indexes

Critical indexes for performance:

-- Text lookups
CREATE INDEX idx_syllables_text ON syllables(syllable);
CREATE INDEX idx_words_text ON words(word);

-- N-gram lookups (using word IDs)
CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);
CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);

VACUUM

Compact database after building:

VACUUM;
ANALYZE;

Page Size

Optimize for read performance:

PRAGMA page_size = 4096;
PRAGMA cache_size = 10000;

Schema Migration

Version Tracking

def get_schema_version(conn):
    cursor = conn.execute(
        "SELECT value FROM schema_info WHERE key = 'version'"
    )
    row = cursor.fetchone()
    return row[0] if row else "1.0.0"

Migration Example

def migrate_v1_to_v2(conn):
    # Add new column
    conn.execute("ALTER TABLE words ADD COLUMN pos_tags TEXT")

    # Update metadata
    conn.execute(
        "UPDATE schema_info SET value = '2.0.0' WHERE key = 'version'"
    )
    conn.commit()

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Overview

Core Tables

syllables

words

bigrams

trigrams

POS Probability Tables

pos_unigrams

pos_bigrams

pos_trigrams

File Tracking Table

processed_files

Query Examples

Lookup Syllable

Get Word with POS

Get Bigram Probability

Get Top Continuations

Get POS Transition Probability

Database Optimization

Indexes

VACUUM

Page Size

Schema Migration

Version Tracking

Migration Example

See Also

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​Core Tables

​syllables

​words

​bigrams

​trigrams

​POS Probability Tables

​pos_unigrams

​pos_bigrams

​pos_trigrams

​File Tracking Table

​processed_files

​Query Examples

​Lookup Syllable

​Get Word with POS

​Get Bigram Probability

​Get Top Continuations

​Get POS Transition Probability

​Database Optimization

​Indexes

​VACUUM

​Page Size

​Schema Migration

​Version Tracking

​Migration Example

​See Also

Overview

Core Tables

syllables

words

bigrams

trigrams

POS Probability Tables

pos_unigrams

pos_bigrams

pos_trigrams

File Tracking Table

processed_files

Query Examples

Lookup Syllable

Get Word with POS

Get Bigram Probability

Get Top Continuations

Get POS Transition Probability

Database Optimization

Indexes

VACUUM

Page Size

Schema Migration

Version Tracking

Migration Example

See Also