Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The dictionary database is a single SQLite file containing tables for syllables, words, N-gram probabilities, POS statistics, and file-tracking metadata. This page documents every table, column, index, and common query pattern.
Overview
The dictionary database contains tables for syllables, words, N-grams, and metadata.
Core Tables
syllables
Stores valid Myanmar syllables:
CREATE TABLE syllables (
id INTEGER PRIMARY KEY AUTOINCREMENT,
syllable TEXT UNIQUE NOT NULL,
frequency INTEGER DEFAULT 0
);
CREATE INDEX idx_syllables_text ON syllables(syllable);
| Column | Type | Description |
|---|
id | INTEGER | Primary key |
syllable | TEXT | Syllable text (unique) |
frequency | INTEGER | Corpus frequency count |
words
Stores dictionary words:
CREATE TABLE words (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word TEXT UNIQUE NOT NULL,
syllable_count INTEGER,
frequency INTEGER DEFAULT 0,
pos_tag TEXT,
is_curated INTEGER DEFAULT 0,
inferred_pos TEXT,
inferred_confidence REAL,
inferred_source TEXT
);
CREATE INDEX idx_words_text ON words(word);
| Column | Type | Description |
|---|
id | INTEGER | Primary key |
word | TEXT | Word text (unique) |
syllable_count | INTEGER | Number of syllables |
frequency | INTEGER | Corpus frequency |
pos_tag | TEXT | POS tag from corpus |
is_curated | INTEGER | Whether word is curated (0/1) |
inferred_pos | TEXT | POS tag from inference |
inferred_confidence | REAL | Confidence of inferred POS |
inferred_source | TEXT | Source of inference |
Curated Words:
Words from --curated-input are inserted directly with is_curated=1 before corpus processing. When corpus words are loaded:
-- ON CONFLICT preserves is_curated=1 for curated words
INSERT INTO words (word, syllable_count, frequency, pos_tag, is_curated)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(word) DO UPDATE SET
frequency=excluded.frequency,
syllable_count=excluded.syllable_count,
pos_tag=COALESCE(excluded.pos_tag, words.pos_tag),
is_curated=MAX(words.is_curated, excluded.is_curated)
| Scenario | frequency | is_curated |
|---|
| Curated only | 0 | 1 |
| Curated + Corpus | corpus_freq | 1 |
| Corpus only | corpus_freq | 0 |
bigrams
Stores word bigram frequencies (using word IDs for efficiency):
CREATE TABLE bigrams (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1_id INTEGER,
word2_id INTEGER,
probability REAL DEFAULT 0.0,
count INTEGER DEFAULT 0,
FOREIGN KEY(word1_id) REFERENCES words(id),
FOREIGN KEY(word2_id) REFERENCES words(id),
UNIQUE(word1_id, word2_id)
);
CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);
| Column | Type | Description |
|---|
word1_id | INTEGER | Foreign key to first word |
word2_id | INTEGER | Foreign key to second word |
probability | REAL | P(word2 | word1) |
count | INTEGER | Raw co-occurrence count |
trigrams
Stores word trigram frequencies:
CREATE TABLE trigrams (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1_id INTEGER,
word2_id INTEGER,
word3_id INTEGER,
probability REAL DEFAULT 0.0,
count INTEGER DEFAULT 0,
FOREIGN KEY(word1_id) REFERENCES words(id),
FOREIGN KEY(word2_id) REFERENCES words(id),
FOREIGN KEY(word3_id) REFERENCES words(id),
UNIQUE(word1_id, word2_id, word3_id)
);
CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);
Higher-Order N-gram Tables
fourgrams
Stores 4-gram conditional probabilities for deeper context analysis:
CREATE TABLE fourgrams (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1_id INTEGER,
word2_id INTEGER,
word3_id INTEGER,
word4_id INTEGER,
probability REAL DEFAULT 0.0,
count INTEGER DEFAULT 0,
FOREIGN KEY(word1_id) REFERENCES words(id),
FOREIGN KEY(word2_id) REFERENCES words(id),
FOREIGN KEY(word3_id) REFERENCES words(id),
FOREIGN KEY(word4_id) REFERENCES words(id),
UNIQUE(word1_id, word2_id, word3_id, word4_id)
);
CREATE INDEX idx_fourgrams_w1_w2_w3_w4 ON fourgrams(word1_id, word2_id, word3_id, word4_id);
fivegrams
Stores 5-gram conditional probabilities:
CREATE TABLE fivegrams (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1_id INTEGER,
word2_id INTEGER,
word3_id INTEGER,
word4_id INTEGER,
word5_id INTEGER,
probability REAL DEFAULT 0.0,
count INTEGER DEFAULT 0,
FOREIGN KEY(word1_id) REFERENCES words(id),
FOREIGN KEY(word2_id) REFERENCES words(id),
FOREIGN KEY(word3_id) REFERENCES words(id),
FOREIGN KEY(word4_id) REFERENCES words(id),
FOREIGN KEY(word5_id) REFERENCES words(id),
UNIQUE(word1_id, word2_id, word3_id, word4_id, word5_id)
);
CREATE INDEX idx_fivegrams_w1_w2_w3_w4_w5 ON fivegrams(word1_id, word2_id, word3_id, word4_id, word5_id);
POS Probability Tables
pos_unigrams
Stores POS unigram probabilities:
CREATE TABLE pos_unigrams (
pos TEXT UNIQUE NOT NULL,
probability REAL DEFAULT 0.0
);
pos_bigrams
Stores POS bigram probabilities:
CREATE TABLE pos_bigrams (
pos1 TEXT NOT NULL,
pos2 TEXT NOT NULL,
probability REAL DEFAULT 0.0,
UNIQUE(pos1, pos2)
);
pos_trigrams
Stores POS trigram probabilities:
CREATE TABLE pos_trigrams (
pos1 TEXT NOT NULL,
pos2 TEXT NOT NULL,
pos3 TEXT NOT NULL,
probability REAL DEFAULT 0.0,
UNIQUE(pos1, pos2, pos3)
);
File Tracking Table
processed_files
Tracks processed files for incremental builds:
CREATE TABLE processed_files (
path TEXT PRIMARY KEY,
mtime REAL,
size INTEGER
);
| Column | Type | Description |
|---|
path | TEXT | File path (unique) |
mtime | REAL | File modification time |
size | INTEGER | File size in bytes |
Stores key-value metadata about the database build:
CREATE TABLE metadata (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
Enrichment Tables
These tables are populated during the enrichment step (--no-enrich to skip).
confusable_pairs
Stores phonetically or orthographically similar word pairs mined from the corpus:
CREATE TABLE confusable_pairs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1 TEXT NOT NULL,
word2 TEXT NOT NULL,
confusion_type TEXT NOT NULL,
context_overlap REAL DEFAULT 0.0,
freq_ratio REAL,
suppress INTEGER DEFAULT 0,
source TEXT DEFAULT 'mined',
UNIQUE(word1, word2, confusion_type)
);
CREATE INDEX idx_confusable_word1 ON confusable_pairs(word1);
CREATE INDEX idx_confusable_word2 ON confusable_pairs(word2);
| Column | Type | Description |
|---|
word1 | TEXT | First word in the confusable pair |
word2 | TEXT | Second word (the confusable variant) |
confusion_type | TEXT | Type of confusion (aspiration, medial, tone, nasal) |
context_overlap | REAL | Context overlap score between the two words |
freq_ratio | REAL | Frequency ratio between the two words |
suppress | INTEGER | Whether this pair is suppressed (0=active, 1=suppressed) |
source | TEXT | Source of the pair (mined, curated) |
compound_confusions
Stores compound words that may be incorrectly split during segmentation:
CREATE TABLE compound_confusions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
compound TEXT NOT NULL,
part1 TEXT NOT NULL,
part2 TEXT NOT NULL,
compound_freq INTEGER DEFAULT 0,
split_freq INTEGER DEFAULT 0,
pmi REAL DEFAULT 0.0,
UNIQUE(compound, part1, part2)
);
CREATE INDEX idx_compound_word ON compound_confusions(compound);
CREATE INDEX idx_compound_parts ON compound_confusions(part1, part2);
collocations
Stores word collocations with PMI (Pointwise Mutual Information) scores:
CREATE TABLE collocations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word1 TEXT NOT NULL,
word2 TEXT NOT NULL,
pmi REAL NOT NULL,
npmi REAL,
count INTEGER NOT NULL,
UNIQUE(word1, word2)
);
CREATE INDEX idx_colloc_word1 ON collocations(word1);
CREATE INDEX idx_colloc_word2 ON collocations(word2);
Stores formal/informal register classification for words:
CREATE TABLE register_tags (
word TEXT PRIMARY KEY,
register TEXT NOT NULL,
confidence REAL DEFAULT 0.0,
formal_count INTEGER DEFAULT 0,
informal_count INTEGER DEFAULT 0
);
| Column | Type | Description |
|---|
word | TEXT | The word |
register | TEXT | Register classification (formal, informal, neutral) |
confidence | REAL | Classification confidence score |
formal_count | INTEGER | Count of formal context occurrences |
informal_count | INTEGER | Count of informal context occurrences |
Query Examples
Lookup Syllable
cursor.execute("""
SELECT frequency
FROM syllables
WHERE syllable = ?
""", ("မြန်",))
Get Word with POS
cursor.execute("""
SELECT word, frequency, pos_tag, inferred_pos
FROM words
WHERE word = ?
""", ("မြန်မာ",))
Get Bigram Probability
cursor.execute("""
SELECT b.probability
FROM bigrams b
JOIN words w1 ON b.word1_id = w1.id
JOIN words w2 ON b.word2_id = w2.id
WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))
Get Top Continuations
cursor.execute("""
SELECT w2.word, b.probability
FROM bigrams b
JOIN words w1 ON b.word1_id = w1.id
JOIN words w2 ON b.word2_id = w2.id
WHERE w1.word = ?
ORDER BY b.probability DESC
LIMIT 10
""", ("ထမင်း",))
Get POS Transition Probability
cursor.execute("""
SELECT probability
FROM pos_bigrams
WHERE pos1 = ? AND pos2 = ?
""", ("N", "V"))
Database Optimization
Indexes
Critical indexes for performance:
-- Text lookups
CREATE INDEX idx_syllables_text ON syllables(syllable);
CREATE INDEX idx_words_text ON words(word);
-- N-gram lookups (using word IDs)
CREATE INDEX idx_bigrams_w1_w2 ON bigrams(word1_id, word2_id);
CREATE INDEX idx_trigrams_w1_w2_w3 ON trigrams(word1_id, word2_id, word3_id);
VACUUM
Compact database after building:
Page Size
Optimize for read performance:
PRAGMA page_size = 4096;
PRAGMA cache_size = 10000;
Schema Migration
Version Tracking
def get_schema_version(conn):
cursor = conn.execute(
"SELECT value FROM metadata WHERE key = 'version'"
)
row = cursor.fetchone()
return row[0] if row else "1.0.0"
Migration Example
def migrate_v1_to_v2(conn):
# Add new column
conn.execute("ALTER TABLE words ADD COLUMN pos_tags TEXT")
# Update metadata
conn.execute(
"UPDATE metadata SET value = '2.0.0' WHERE key = 'version'"
)
conn.commit()
See Also