Overview - mySpellChecker

Each strategy in the pipeline can be toggled, tuned, or replaced independently. This page summarizes every feature and links to its dedicated guide.

Feature Matrix

Note: Speed ratings are relative comparisons, not validated benchmarks. Actual performance depends on your dictionary size, hardware, and configuration.

Feature	Description	Speed	Optional
Syllable Validation	Rule-based syllable structure checking	Very Fast	No
Word Validation	Dictionary lookup with SymSpell suggestions	Fast	No
Context Checking	N-gram based context validation	Moderate	Yes
Grammar Checking	POS-based syntactic validation	Fast	Yes
Semantic Checking	AI-powered deep context analysis	Slow	Yes
NER	Named entity recognition	Varies	Yes
Morphology	Word structure analysis	Very Fast	Yes
Morphological Synthesis	Compound/reduplication validation	Very Fast	Yes
Grammar Checkers	Aspect/Classifier/Compound/MergedWord/Negation/Particle/TenseAgreement/Register	Fast	Yes
Validation Strategies	Composable validation pipeline (12 strategies)	Varies	Yes
Normalization	Unified text normalization service	Very Fast	No
Batch Processing	Parallel multi-text processing	Varies	No
Async API	Non-blocking async operations	-	No
Streaming API	Memory-efficient large file processing	Varies	No
Segmenters	Syllable/word/sentence segmentation	Very Fast	No
Suggestion Ranking	Multi-factor suggestion scoring	Very Fast	No
Connection Pool	Thread-safe connection management	-	No
Homophones	Sound-alike word detection	Fast	Yes
Colloquial Variants	Informal/formal spelling detection	Very Fast	Yes
i18n (Localization)	Error messages in English/Myanmar	Very Fast	No

Core Features

Syllable Validation

The foundation of mySpellChecker. Validates Myanmar syllable structure using orthographic rules and dictionary lookup. Key capabilities:

Rule-based syllable structure validation
Consonant-medial-vowel pattern checking
Dictionary syllable lookup
O(1) validation performance

# Syllable validation catches ~90% of typos immediately
result = checker.check("မြန်မာ")  # Valid syllables

Word Validation

Validates complete words using dictionary lookup and the SymSpell algorithm for efficient suggestion generation. Key capabilities:

Dictionary word lookup
SymSpell O(1) suggestions
Edit distance calculation
Compound word handling

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
from myspellchecker.core.constants import ValidationLevel

# Get word-level suggestions (level specified per-check)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
result = checker.check(text, level=ValidationLevel.WORD)

Context Checking

Detects “real-word errors” where a word is spelled correctly but used incorrectly in context. Key capabilities:

Bigram probability analysis
Trigram context windows
Statistical language modeling
Real-word error detection

# Detects unnatural word combinations (e.g., "rice go" vs "eat rice")
config = SpellCheckerConfig(use_context_checker=True)

Advanced Features

POS Tagging

Part-of-Speech tagging with multiple backend options for different accuracy/speed trade-offs. Tagger options:

Type	Accuracy	Speed	Dependencies
Rule-based	~70%	Fast	None
Viterbi	~85%	Medium	None
Transformer	~93%	Slow	transformers, torch

from myspellchecker.core.config import POSTaggerConfig

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(tagger_type="transformer")
)

Grammar Checking

Rule-based syntactic validation using POS tags to detect grammatical errors. Key capabilities:

Particle usage validation
Verb-modifier agreement
Sentence structure checking
Custom grammar rule support

# Detects particle errors like မှာ vs မှ
config = SpellCheckerConfig(use_rule_based_validation=True)

Grammar Engine

Comprehensive syntactic rule checker coordinating eight specialized checkers. Key capabilities:

Particle typo detection
Medial confusion detection (ျ vs ြ)
POS sequence validation
Verb-particle agreement
Configurable confidence thresholds

from myspellchecker.grammar import SyntacticRuleChecker

checker = SyntacticRuleChecker(provider)
corrections = checker.check_sequence(["ကျွန်တော်", "ကျောင်း", "သွားတယ်"])

Semantic Checking

Deep learning-based context analysis using ONNX models for the highest accuracy. Key capabilities:

BERT/RoBERTa masked language modeling
Semantic context understanding
Confidence scoring
Quantized CPU inference

from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

# Enable AI-powered checking
config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="path/to/model.onnx",
        tokenizer_path="path/to/tokenizer"
    )
)

Performance Features

Batch Processing

Efficient processing of multiple texts with parallelization. Key capabilities:

Cython-optimized processing
OpenMP parallelization
Batch result aggregation
Memory-efficient streaming

# Process thousands of texts efficiently
results = checker.check_batch(texts)

Async API

Non-blocking async operations for web applications. Key capabilities:

Native async/await support
FastAPI/Starlette integration
Concurrent request handling
Async batch processing

# Non-blocking spell checking
result = await checker.check_async(text)
results = await checker.check_batch_async(texts)

Integration Features

Connection Pool

Thread-safe database connection management for high-concurrency scenarios. Key capabilities:

Configurable min/max pool size
Automatic connection health checks
Connection aging and recreation
Pool statistics and monitoring

from myspellchecker.providers.connection_pool import ConnectionPool
from myspellchecker.core.config import ConnectionPoolConfig

pool_config = ConnectionPoolConfig(min_size=2, max_size=10)
pool = ConnectionPool("/path/to/db.sqlite", pool_config=pool_config)
with pool.checkout() as conn:
    cursor = conn.cursor()

Segmenters

Multiple text segmentation strategies for Myanmar text. Segmenter types:

Type	Description	Use Case
DefaultSegmenter	Production segmenter	General use
RegexSegmenter	Rule-based syllables	Lightweight

from myspellchecker.segmenters import DefaultSegmenter

segmenter = DefaultSegmenter(word_engine="myword")
syllables = segmenter.segment_syllables("မြန်မာစာ")

Homophones Detection

Detects sound-alike words that may be confused in context.

from myspellchecker.core.homophones import HomophoneChecker

checker = HomophoneChecker()
homophones = checker.get_homophones("ကျား")  # Returns set of homophones
has_match = len(checker.get_homophones("ကြား")) > 0  # Check if homophones exist

Colloquial Variant Handling

Detects colloquial (informal) spellings and suggests standard forms. Key capabilities:

Colloquial form detection
Standard form suggestion
Configurable strictness levels

from myspellchecker.text.phonetic_data import is_colloquial_variant, get_standard_forms

# Check if word is colloquial
is_colloquial_variant("ကျနော်")  # True

# Get standard form
get_standard_forms("ကျနော်")  # ["ကျွန်တော်"]

Configuration:

from myspellchecker.core.config.validation_configs import ValidationConfig

config = ValidationConfig(
    colloquial_strictness="lenient",  # "strict", "lenient", or "off"
    colloquial_info_confidence=0.3,
)

Strictness	Behavior
`strict`	Flag all colloquial variants as errors
`lenient`	Accept with informational note (default)
`off`	No special handling

Internationalization (i18n)

Localized error messages in English and Myanmar.

from myspellchecker.core.i18n import set_language, get_message

# Set language to Myanmar
set_language("my")

# Get localized message
get_message("invalid_syllable")
# Output: စာလုံးပေါင်း မမှန်ကန်ပါ

Supported languages: "en" (English), "my" (Myanmar)

Streaming API

Memory-efficient stream processing for large documents with progress callbacks. Key capabilities:

Generator-based synchronous streaming
Async iteration support
Progress callbacks and statistics
Memory limits with backpressure
Cross-sentence context validation

from myspellchecker.core.streaming import StreamingChecker

streaming = StreamingChecker(checker)
with open("large_file.txt") as f:
    for result in streaming.check_stream(f):
        if result.response.has_errors:
            process(result)

Custom Providers

Pluggable storage backends for different use cases.

from myspellchecker.providers import MemoryProvider, SQLiteProvider

# High-speed in-memory
checker = SpellChecker(provider=MemoryProvider())

# Disk-based for large dictionaries
checker = SpellChecker(provider=SQLiteProvider())

Feature Comparison by Use Case

Real-Time Typing

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
from myspellchecker.core.constants import ValidationLevel

# Fastest: syllable-only validation (level specified per-check)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
result = checker.check(text, level=ValidationLevel.SYLLABLE)

Document Checking

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider
from myspellchecker.core.constants import ValidationLevel

# Balanced: word + context
config = SpellCheckerConfig(
    use_context_checker=True,
    use_rule_based_validation=True,
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
result = checker.check(text, level=ValidationLevel.WORD)

Quality Assurance

# Thorough: full validation with AI (requires SQLiteProvider from above)
config = SpellCheckerConfig(
    use_context_checker=True,
    semantic=SemanticConfig(model_path="..."),
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)
# Use word-level validation for thorough checking
result = checker.check(text, level=ValidationLevel.WORD, use_semantic=True)

High-Volume Processing

# Optimized for throughput
from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider
from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(use_context_checker=False)  # Faster
provider = SQLiteProvider(pool_max_size=10)
checker = SpellChecker(config=config, provider=provider)
results = checker.check_batch(texts)

Feature Dependencies

  +------------------------+
  | Syllable Validation    |  [core]
  +-----------+------------+
              |
              v
  +------------------------+
  | Word Validation        |  [core]
  +-----------+------------+
              |
         +----+----+
         |         |
         v         v
  +-------------+  +------------------+
  | Context     |  | Grammar          |  [advanced]
  | Checking    |  | Checking         |
  +------+------+  +--------+---------+
         |                  |
         v                  v
  +-------------+  +------------------+
  | Semantic    |  | POS Tagging      |  [advanced]
  | Checking    |  |                  |
  +-------------+  +------------------+
    [ai]

Legend:

Green: Core features (always available)
Blue: Advanced features (optional)
Purple: AI features (requires extra dependencies)

Text Processing Features

Named Entity Recognition

Identifies names, locations, and organizations to reduce false positives. Key capabilities:

Heuristic-based NER (fast, ~70% accuracy)
Transformer-based NER (~93% accuracy)
Hybrid mode with automatic fallback
Entity filtering for spell checking

from myspellchecker.text.ner_model import NERConfig

config = SpellCheckerConfig(
    ner=NERConfig(enabled=True, model_type="heuristic")
)

Morphology Analysis

Word structure analysis for POS inference and OOV recovery. Key capabilities:

Suffix-based POS guessing
Word decomposition (root + suffixes)
Multi-POS support for ambiguous words
Numeral detection
Productive reduplication validation (AA, AABB, ABAB patterns)
Compound word synthesis (DP-based splitting into known morphemes)
Morpheme-level suggestions (correct typos inside compounds)

from myspellchecker.text.morphology import MorphologyAnalyzer
from myspellchecker.text.reduplication import ReduplicationEngine
from myspellchecker.text.compound_resolver import CompoundResolver

# OOV analysis (existing)
analyzer = MorphologyAnalyzer()
result = analyzer.analyze_word("စားခဲ့သည်")
print(result.root)      # "စား"
print(result.suffixes)  # ["ခဲ့", "သည်"]

# Reduplication validationengine = ReduplicationEngine(segmenter=segmenter)
result = engine.analyze("ကောင်းကောင်း", dict_check, freq_check, pos_check)
# Valid AA reduplication of "ကောင်း"

# Compound synthesisresolver = CompoundResolver(segmenter=segmenter)
result = resolver.resolve("ကျောင်းသား", dict_check, freq_check, pos_check)
# Valid N+N compound: ["ကျောင်း", "သား"]

Text Utilities

Specialized utilities for Myanmar text processing. Key capabilities:

Stemmer: Rule-based suffix stripping with caching
Phonetic Hasher: Sound-based fuzzy matching
Tone Disambiguator: Context-based tone resolution
Zawgyi Detection: Legacy encoding detection

from myspellchecker.text.stemmer import Stemmer
from myspellchecker.text.phonetic import PhoneticHasher

stemmer = Stemmer()
hasher = PhoneticHasher()

Grammar Features

Suggestion Ranking

Multi-factor ranking system for spelling suggestions. Ranker types:

Ranker	Primary Factor	Use Case
DefaultRanker	Edit distance + frequency	General use
FrequencyFirstRanker	Corpus frequency	Autocomplete
PhoneticFirstRanker	Phonetic similarity	Myanmar text
UnifiedRanker	Multi-source	Comprehensive

from myspellchecker.algorithms.ranker import FrequencyFirstRanker

ranker = FrequencyFirstRanker()
symspell = SymSpell(provider, ranker=ranker)

Grammar Checkers

Eight specialized checkers for Myanmar grammar validation.

Checker	Purpose
AspectChecker	Verb aspect markers
ClassifierChecker	Numeral classifiers
CompoundChecker	Compound words
MergedWordChecker	Merged particle+verb detection
NegationChecker	Negation patterns
ParticleChecker	Particle context validation
TenseAgreementChecker	Tense-time agreement
RegisterChecker	Formal/colloquial register

from myspellchecker.grammar.checkers.aspect import AspectChecker
from myspellchecker.grammar.checkers.register import RegisterChecker

aspect_checker = AspectChecker()
register_checker = RegisterChecker()

Text Normalization

Unified normalization service for consistent text processing. Key capabilities:

Purpose-specific normalization methods
Zawgyi detection and conversion
Unicode NFC normalization
Myanmar diacritic reordering

from myspellchecker.text.normalization_service import get_normalization_service

service = get_normalization_service()
normalized = service.for_spell_checking(text)

Validation Strategies

Strategy-based validation pipeline for composable error detection. Strategies (by priority):

Strategy	Priority	Purpose
ToneValidation	10	Tone mark disambiguation
Orthography	15	Orthographic error detection
SyntacticRule	20	Grammar rule checking
BrokenCompound	25	Broken compound detection
POSSequence	30	POS sequence validation
Question	40	Question structure
Homophone	45	Sound-alike detection
ConfusableSemantic	48	AI confusable detection (opt-in)
NgramContext	50	N-gram probability
Semantic	70	AI-powered validation (opt-in)

Architecture

Dependency Injection

Lightweight DI system for component management. Key components:

ServiceContainer for lazy initialization
Factory functions for component creation
Singleton and transient service support
Thread-safe service resolution

Reference

Rules System

YAML configuration files for linguistic rules. Key files:

particles.yaml - 91 linguistic particles
typo_corrections.yaml - Common typo patterns
morphology.yaml - Suffix/prefix patterns
morphotactics.yaml - Compound word POS pattern rules
aspects.yaml - Verb aspect markers
classifiers.yaml - Numeral classifiers
register.yaml - Formal/colloquial mappings

Guides

Configuration Guide

Comprehensive configuration options. Topics:

SpellCheckerConfig and nested configs
Pre-defined configuration profiles
Loading from files and environment

Logging Guide

Centralized logging system. Features:

Development and production modes
JSON structured logging
Module-specific log levels
get_logger() for consistent naming

Training Features

Training Pipeline

End-to-end pipeline for training custom semantic models. Pipeline stages:

Tokenizer Training (Byte-Level BPE)
Model Training (RoBERTa/BERT MLM)
ONNX Export (quantized)

from myspellchecker.training import TrainingPipeline, TrainingConfig

config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./models/",
    architecture="roberta",
    epochs=5,
)
pipeline = TrainingPipeline()
model_path = pipeline.run(config)

Text Validation

Comprehensive Myanmar text quality validation with 30+ validation categories. Key capabilities:

Structural validation (syllable structure, encoding)
Zawgyi artifact detection
Quality filtering (fragments, truncation)
Known invalid word detection

from myspellchecker.text.validator import validate_word

is_valid = validate_word("ကျောင်း")
if is_valid:
    print("Word is valid")

Next Steps

Explore individual feature documentation
Configuration Guide for enabling/disabling features
Performance Tuning for optimization
Rules System for customization
Dependency Injection for architecture
Extension Points for code patterns

​Feature Matrix

​Core Features

​Advanced Features

​Performance Features

​Integration Features

​Colloquial Variant Handling

​Internationalization (i18n)

​Feature Comparison by Use Case

​Real-Time Typing

​Document Checking

​Quality Assurance

​High-Volume Processing

​Feature Dependencies

​Text Processing Features

​Grammar Features

​Architecture

​Reference

​Guides

​Training Features

​Next Steps