Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
The system uses a layered architecture with pluggable components connected through factories, the Builder pattern, and strategy interfaces.
Component Architecture
Core Components
The core components are organized in layers: Configuration (Builder, Factory) → Validators (Syllable, Word, Context) → Algorithms (SymSpell, N-gram, POS, Semantic) → Infrastructure (Segmenter, Provider, Normalizer).
Component Responsibilities
| Component | Responsibility |
|---|
| SpellChecker | Orchestrate validation, manage lifecycle. Uses mixin decomposition: PreNormalizationDetectorsMixin, PostNormalizationDetectorsMixin, SentenceDetectorsMixin, SuggestionPipelineMixin, ErrorSuppressionMixin |
| Configuration | Store settings, validation levels, thresholds |
| Builder | Fluent construction of SpellChecker |
| ComponentFactory | Create and wire components |
| SyllableValidator | Layer 1: syllable-level validation |
| WordValidator | Layer 2: word-level validation |
| ContextValidator | Layer 3: context validation via strategy pattern (created by ContextValidatorFactory; orchestrates 12 strategies: Tone, Orthography, Syntactic, StatisticalConfusable, BrokenCompound, POS Sequence, Question, Homophone, ConfusableCompoundClassifier, ConfusableSemantic, N-gram Context, and Semantic) |
| SymSpell | O(1) suggestion generation |
| NgramContextChecker | Context probability calculation (created by ContextCheckerFactory, distinct from ContextValidatorFactory) |
| POSTagger | Part-of-speech tagging |
| SemanticChecker | Deep context analysis |
| Segmenter | Text segmentation |
| Provider | Dictionary data access |
| Normalizer | Text normalization |
| TokenRefinement | Token boundary refinement that exposes hidden error spans in merged tokens (particle attachment, negation attachment) using a lattice-based scoring pass |
| NeuralReranker | Optional MLP-based suggestion re-ranking using ONNX (19-feature vector, runs as final pipeline step) |
| MedialSwapSuggestionStrategy | Generates medial swap candidates (ျ↔ြ, ွ↔ှ) that SymSpell’s delete-distance model cannot find |
| Response | Result dataclass (text, errors, metadata) |
Design Patterns
Builder Pattern
SpellCheckerBuilder provides fluent construction:
class SpellCheckerBuilder:
def __init__(self):
self._config = SpellCheckerConfig()
self._provider = None
self._segmenter = None
def with_config(self, config: SpellCheckerConfig) -> "SpellCheckerBuilder":
self._config = config
return self
def with_provider(self, provider: DictionaryProvider) -> "SpellCheckerBuilder":
self._provider = provider
return self
def with_segmenter(self, segmenter: Segmenter) -> "SpellCheckerBuilder":
self._segmenter = segmenter
return self
def build(self) -> SpellChecker:
# Resolve provider (auto-detect SQLiteProvider or fallback to MemoryProvider)
provider = self._provider
if provider is None:
try:
provider = SQLiteProvider(
database_path=self._config.provider_config.database_path,
cache_size=self._config.provider_config.cache_size,
)
except MissingDatabaseError:
if self._config.fallback_to_empty_provider:
provider = MemoryProvider()
else:
raise
# Resolve segmenter
segmenter = self._segmenter or DefaultSegmenter(
word_engine=self._config.word_engine,
)
# SpellChecker.__init__ uses ComponentFactory internally
return SpellChecker(
config=self._config,
provider=provider,
segmenter=segmenter,
)
Factory Pattern
ComponentFactory creates configured components. It takes only config (not provider) —
the provider and segmenter are passed to create_all():
class ComponentFactory:
def __init__(self, config: SpellCheckerConfig):
self.config = config
def create_all(self, provider, segmenter) -> dict[str, Any]:
"""Create all SpellChecker components in proper dependency order.
Returns a dict with keys:
syllable_validator, word_validator, context_validator,
viterbi_tagger, joint_segment_tagger, syntactic_rule_checker,
semantic_checker, context_checker, name_heuristic, phonetic_hasher
"""
# 1. Create algorithm components (SymSpell, ViterbiTagger, etc.)
phonetic_hasher = self.create_phonetic_hasher()
symspell = self.create_symspell(provider, phonetic_hasher)
# ... POS probabilities, context checker, etc.
# 2. Create validators with their actual constructor signatures
syllable_validator = SyllableValidator(
config=self.config,
segmenter=segmenter,
repository=provider, # SyllableRepository interface
symspell=symspell,
syllable_rule_validator=SyllableRuleValidator(),
)
word_validator = WordValidator(
config=self.config,
segmenter=segmenter,
word_repository=provider, # WordRepository interface
syllable_repository=provider, # SyllableRepository interface
symspell=symspell,
context_checker=context_checker,
suggestion_strategy=suggestion_strategy,
)
# ContextValidator uses the Strategy Pattern -- grammar checking
# (SyntacticRuleChecker) is one of several strategies, not a
# separate validator.
context_validator = ContextValidator(
config=self.config,
segmenter=segmenter,
strategies=[...], # List of ValidationStrategy instances
name_heuristic=name_heuristic,
ner_model=ner_model,
viterbi_tagger=viterbi_tagger,
)
return {
"syllable_validator": syllable_validator,
"word_validator": word_validator,
"context_validator": context_validator,
# ... other components
}
Note: SyntacticRuleChecker is not a separate validator. It is wrapped as a
SyntacticValidationStrategy and passed into ContextValidator.strategies.
The ContextValidator orchestrates the 12 strategies wired by
ComponentFactory: Tone (10), Orthography (15), Syntactic (20),
Statistical Confusable (24), Broken Compound (25), POS Sequence (30),
Question Structure (40), Homophone (45), Confusable Compound Classifier (47),
Confusable Semantic (48), N-gram Context (50), and Semantic (70).
Strategy Pattern
Pluggable components implement common interfaces:
# Provider Strategy
class DictionaryProvider(ABC):
def is_valid_syllable(self, syllable: str) -> bool: ...
def is_valid_word(self, word: str) -> bool: ...
def get_word_frequency(self, word: str) -> int: ...
def get_bigram_probability(self, w1: str, w2: str) -> float: ...
# Segmenter Strategy
class Segmenter(ABC):
def segment_syllables(self, text: str) -> list[str]: ...
def segment_words(self, text: str) -> list[str]: ...
# POS Tagger Strategy
class POSTaggerBase(ABC):
def tag_word(self, word: str) -> str: ...
def tag_sequence(self, words: list[str]) -> list[str]: ...
Chain of Responsibility
Validators form a chain, augmented by 38 post-normalization detectors inherited from mixins (see Component Diagram for the full mixin architecture and detection registry):
class SpellChecker(
PreNormalizationDetectorsMixin,
PostNormalizationDetectorsMixin,
SentenceDetectorsMixin,
SuggestionPipelineMixin,
ErrorSuppressionMixin,
):
# check() delegates to _prepare_text() → _run_validation() → _finalize_response()
def _run_validation_layers(self, normalized_text, level, use_semantic):
errors = []
# Layer 1: Syllable validation (always runs)
self._validate_syllables(normalized_text, errors, layers_applied)
self._suppress_cascade_syllable_errors(errors, normalized_text)
self._suppress_pali_stacking_errors(errors, normalized_text)
# 38 post-normalization detectors (run unconditionally, all levels)
from myspellchecker.core.detection_registry import POST_NORM_DETECTOR_SEQUENCE
for entry in POST_NORM_DETECTOR_SEQUENCE:
getattr(self, entry.method_name)(normalized_text, errors)
# Layer 2 & 3: Word and context validation (if level == WORD)
if level == ValidationLevel.WORD:
self._validate_words(normalized_text, errors, layers_applied)
self._validate_context(normalized_text, errors, layers_applied, use_semantic)
# Suggestion reconstruction + dedup pipeline
self._reconstruct_compound_suggestions(normalized_text, errors)
self._dedup_errors_by_position(errors)
self._dedup_errors_by_span(errors)
return errors, layers_applied
Provider Architecture
Interface
class DictionaryProvider(ABC):
"""Abstract interface for dictionary data access."""
# Syllable operations
def is_valid_syllable(self, syllable: str) -> bool: ...
def get_syllable_frequency(self, syllable: str) -> int: ...
# Word operations
def is_valid_word(self, word: str) -> bool: ...
def get_word_frequency(self, word: str) -> int: ...
def get_word_pos(self, word: str) -> str | None: ...
# N-gram operations
def get_bigram_probability(self, w1: str, w2: str) -> float: ...
def get_trigram_probability(self, w1: str, w2: str, w3: str) -> float: ...
def get_top_continuations(self, prev_word: str, limit: int = 20) -> list[tuple[str, float]]: ...
Implementations
class SQLiteProvider(DictionaryProvider):
"""Disk-based provider using SQLite."""
def __init__(
self,
database_path: Optional[str] = None,
cache_size: int = DEFAULT_PROVIDER_CACHE_SIZE,
check_same_thread: bool = False,
pos_tagger: Optional["POSTaggerBase"] = None,
pool_min_size: Optional[int] = None,
pool_max_size: Optional[int] = None,
pool_timeout: Optional[float] = None,
pool_max_connection_age: Optional[float] = None,
sqlite_timeout: Optional[float] = None,
cache_manager: Optional["CacheManager"] = None,
):
if database_path is None:
raise MissingDatabaseError("No database path provided.")
self.database_path = database_path
def is_valid_syllable(self, syllable: str) -> bool:
with self.pool.checkout() as conn:
cursor = conn.execute(
"SELECT 1 FROM syllables WHERE syllable = ?",
(syllable,)
)
return cursor.fetchone() is not None
class MemoryProvider(DictionaryProvider):
"""In-memory provider for high performance."""
def __init__(
self,
syllables: Optional[Dict[str, int]] = None,
words: Optional[Dict[str, int]] = None,
bigrams: Optional[Dict[Tuple[str, str], float]] = None,
trigrams: Optional[Dict[Tuple[str, str, str], float]] = None,
word_pos: Optional[Dict[str, str]] = None,
):
self.syllables = syllables or {}
self.words = words or {}
self.bigrams = bigrams or {}
def is_valid_syllable(self, syllable: str) -> bool:
return syllable in self.syllables
Cython Integration
Wrapper Pattern
Python wrappers with Cython fallback:
# normalize.py
try:
# Try to import Cython version
from .normalize_c import remove_zero_width_chars, reorder_myanmar_diacritics
except ImportError:
# Pure Python fallbacks
def remove_zero_width_chars(text: str) -> str:
"""Remove zero-width characters from text."""
return "".join(c for c in text if c not in ZERO_WIDTH_CHARS)
def reorder_myanmar_diacritics(text: str) -> str:
"""Reorder diacritics to canonical order."""
# ... implementation
return text
def normalize(text: str, form: str = "NFC", ...) -> str:
"""Main normalization function (public API)."""
text = remove_zero_width_chars(text)
text = reorder_myanmar_diacritics(text)
# ... more normalization steps
return text
Cython Source
# normalize_c.pyx
from cpython.unicode cimport PyUnicode_READ, PyUnicode_GET_LENGTH
cpdef str remove_zero_width_chars(str text):
"""Fast removal of zero-width characters."""
cdef:
Py_UCS4 c
list result = []
int i, n
n = PyUnicode_GET_LENGTH(text)
for i in range(n):
c = text[i]
if c not in ZERO_WIDTH_SET:
result.append(chr(c))
return "".join(result)
Error Handling
Graceful Degradation
Components fail gracefully:
class ComponentFactory:
def create_semantic_checker(self) -> Optional[SemanticChecker]:
"""Create semantic checker with graceful degradation."""
has_paths = self.config.semantic.model_path and self.config.semantic.tokenizer_path
has_instances = self.config.semantic.model and self.config.semantic.tokenizer
if not (has_paths or has_instances):
return None
try:
checker = SemanticChecker(
model_path=self.config.semantic.model_path,
tokenizer_path=self.config.semantic.tokenizer_path,
)
self.logger.info("SemanticChecker initialized")
return checker
except Exception as e:
# Log and continue without semantic checking
self.logger.warning(
f"SemanticChecker failed to initialize: {e}. "
"Continuing without semantic checking."
)
return None
Exception Hierarchy
class MyanmarSpellcheckError(Exception):
"""Base exception for all spell checker errors."""
class ConfigurationError(MyanmarSpellcheckError): ...
class InvalidConfigError(ConfigurationError): ...
class DataLoadingError(MyanmarSpellcheckError): ...
class MissingDatabaseError(DataLoadingError): ...
class ProcessingError(MyanmarSpellcheckError): ...
class ValidationError(ProcessingError): ...
class TokenizationError(ProcessingError): ...
class NormalizationError(ProcessingError): ...
class ProviderError(MyanmarSpellcheckError): ...
class ConnectionPoolError(ProviderError): ...
class PipelineError(MyanmarSpellcheckError): ...
class IngestionError(PipelineError): ...
class PackagingError(PipelineError): ...
class ModelError(MyanmarSpellcheckError): ...
class ModelLoadError(ModelError): ...
class InferenceError(ModelError): ...
class MissingDependencyError(MyanmarSpellcheckError): ...
class InsufficientStorageError(MyanmarSpellcheckError): ...
class CacheError(MyanmarSpellcheckError): ...
Thread Safety
Connection Pooling
class ConnectionPool:
"""Thread-safe SQLite connection pool."""
def __init__(
self,
database_path: Union[str, Path],
pool_config: Optional[ConnectionPoolConfig] = None,
):
self.database_path = Path(database_path)
self.pool_config = pool_config or ConnectionPoolConfig()
@contextmanager
def checkout(self):
"""Get a connection from the pool."""
conn = self._get_connection()
try:
yield conn
finally:
self._return_connection(conn)
Eager Initialization via ComponentFactory
SpellChecker.__init__ creates all components eagerly via ComponentFactory.create_all().
There are no lazy properties for core components:
class SpellChecker:
def __init__(
self,
config: Optional[SpellCheckerConfig] = None,
segmenter: Optional[Segmenter] = None,
provider: Optional[DictionaryProvider] = None,
syllable_validator: Optional[SyllableValidator] = None,
word_validator: Optional[WordValidator] = None,
context_validator: Optional[ContextValidator] = None,
factory: Optional[ComponentFactoryProtocol] = None,
):
# Resolve config, provider, segmenter...
self._factory = factory or ComponentFactory(self.config)
components = self._factory.create_all(self.provider, self.segmenter)
# All components created eagerly
self.syllable_validator = syllable_validator or components["syllable_validator"]
self.word_validator = word_validator or components["word_validator"]
self.context_validator = context_validator or components["context_validator"]
Lazy imports are used at the module level (e.g., TYPE_CHECKING guards, deferred
import inside methods) to avoid circular imports and heavy dependencies, but
component instances are created eagerly during __init__.
Caching
Caching is implemented at the provider level, not the validator level. ComponentFactory
creates cached wrapper objects around the provider using LRU caches:
# ComponentFactory.create_cached_sources() wraps provider lookups
cached_sources = {
"dictionary": CachedDictionaryLookup(provider, syllable_cache_size=..., word_cache_size=...),
"frequency": CachedFrequencySource(provider, cache_size=...),
"bigram": CachedBigramSource(provider, cache_size=...),
"trigram": CachedTrigramSource(provider, cache_size=...),
}
Cache sizes are configured via SpellCheckerConfig.cache (AlgorithmCacheConfig).
Next Steps