System Design

The system uses a layered architecture with pluggable components connected through factories, the Builder pattern, and strategy interfaces.

Component Architecture

Core Components

The core components are organized in layers: Configuration (Builder, Factory) → Validators (Syllable, Word, Context) → Algorithms (SymSpell, N-gram, POS, Semantic) → Infrastructure (Segmenter, Provider, Normalizer).

Component Responsibilities

Component	Responsibility
SpellChecker	Orchestrate validation, manage lifecycle
Configuration	Store settings, validation levels, thresholds
Builder	Fluent construction of SpellChecker
ComponentFactory	Create and wire components
SyllableValidator	Layer 1: syllable-level validation
WordValidator	Layer 2: word-level validation
ContextValidator	Layer 3: context validation via strategy pattern (created by `ContextValidatorFactory`; orchestrates Tone, Orthography, Syntactic, POS Sequence, Question Structure, Homophone, N-gram Context, Error Detection, and Semantic strategies)
SymSpell	O(1) suggestion generation
NgramContextChecker	Context probability calculation (created by `ContextCheckerFactory`, distinct from `ContextValidatorFactory`)
POSTagger	Part-of-speech tagging
SemanticChecker	Deep context analysis
Segmenter	Text segmentation
Provider	Dictionary data access
Normalizer	Text normalization
Response	Result dataclass (text, errors, metadata)

Design Patterns

Builder Pattern

SpellCheckerBuilder provides fluent construction:

class SpellCheckerBuilder:
    def __init__(self):
        self._config = SpellCheckerConfig()
        self._provider = None
        self._segmenter = None

    def with_config(self, config: SpellCheckerConfig) -> "SpellCheckerBuilder":
        self._config = config
        return self

    def with_provider(self, provider: DictionaryProvider) -> "SpellCheckerBuilder":
        self._provider = provider
        return self

    def with_segmenter(self, segmenter: Segmenter) -> "SpellCheckerBuilder":
        self._segmenter = segmenter
        return self

    def build(self) -> SpellChecker:
        # Resolve provider (auto-detect SQLiteProvider or fallback to MemoryProvider)
        provider = self._provider
        if provider is None:
            try:
                provider = SQLiteProvider(
                    database_path=self._config.provider_config.database_path,
                    cache_size=self._config.provider_config.cache_size,
                )
            except MissingDatabaseError:
                if self._config.fallback_to_empty_provider:
                    provider = MemoryProvider()
                else:
                    raise

        # Resolve segmenter
        segmenter = self._segmenter or DefaultSegmenter(
            word_engine=self._config.word_engine,
        )

        # SpellChecker.__init__ uses ComponentFactory internally
        return SpellChecker(
            config=self._config,
            provider=provider,
            segmenter=segmenter,
        )

Factory Pattern

ComponentFactory creates configured components. It takes only config (not provider) — the provider and segmenter are passed to create_all():

class ComponentFactory:
    def __init__(self, config: SpellCheckerConfig):
        self.config = config

    def create_all(self, provider, segmenter) -> dict[str, Any]:
        """Create all SpellChecker components in proper dependency order.

        Returns a dict with keys:
            syllable_validator, word_validator, context_validator,
            viterbi_tagger, joint_segment_tagger, syntactic_rule_checker,
            semantic_checker, context_checker, name_heuristic, phonetic_hasher
        """
        # 1. Create algorithm components (SymSpell, ViterbiTagger, etc.)
        phonetic_hasher = self.create_phonetic_hasher()
        symspell = self.create_symspell(provider, phonetic_hasher)
        # ... POS probabilities, context checker, etc.

        # 2. Create validators with their actual constructor signatures
        syllable_validator = SyllableValidator(
            config=self.config,
            segmenter=segmenter,
            repository=provider,          # SyllableRepository interface
            symspell=symspell,
            syllable_rule_validator=SyllableRuleValidator(),
        )

        word_validator = WordValidator(
            config=self.config,
            segmenter=segmenter,
            word_repository=provider,     # WordRepository interface
            syllable_repository=provider, # SyllableRepository interface
            symspell=symspell,
            context_checker=context_checker,
            suggestion_strategy=suggestion_strategy,
        )

        # ContextValidator uses the Strategy Pattern -- grammar checking
        # (SyntacticRuleChecker) is one of several strategies, not a
        # separate validator.
        context_validator = ContextValidator(
            config=self.config,
            segmenter=segmenter,
            strategies=[...],  # List of ValidationStrategy instances
            name_heuristic=name_heuristic,
        )

        return {
            "syllable_validator": syllable_validator,
            "word_validator": word_validator,
            "context_validator": context_validator,
            # ... other components
        }

Note: SyntacticRuleChecker is not a separate validator. It is wrapped as a SyntacticValidationStrategy and passed into ContextValidator.strategies. The ContextValidator orchestrates the 9 strategies wired by ComponentFactory: Tone (10), Orthography (15), Syntactic (20), POS Sequence (30), Question Structure (40), Homophone (45), N-gram Context (50), Error Detection (65), and Semantic (70).

Strategy Pattern

Pluggable components implement common interfaces:

# Provider Strategy
class DictionaryProvider(ABC):
    def is_valid_syllable(self, syllable: str) -> bool: ...
    def is_valid_word(self, word: str) -> bool: ...
    def get_word_frequency(self, word: str) -> int: ...
    def get_bigram_probability(self, w1: str, w2: str) -> float: ...

# Segmenter Strategy
class Segmenter(ABC):
    def segment_syllables(self, text: str) -> list[str]: ...
    def segment_words(self, text: str) -> list[str]: ...

# POS Tagger Strategy
class POSTaggerBase(ABC):
    def tag_word(self, word: str) -> str: ...
    def tag_sequence(self, words: list[str]) -> list[str]: ...

Chain of Responsibility

Validators form a chain:

class SpellChecker:
    def check(self, text: str, level: ValidationLevel = ValidationLevel.SYLLABLE) -> Response:
        normalized = self.normalizer.normalize(text)

        all_errors = []

        # Layer 1: Syllable validation (always runs)
        syllable_errors = self._syllable_validator.validate(normalized)
        all_errors.extend(syllable_errors)

        # Layer 2+: Word and context validation (if level >= WORD)
        if level == ValidationLevel.WORD:
            word_errors = self._word_validator.validate(normalized)
            all_errors.extend(word_errors)

        return Response(text=text, errors=all_errors, ...)

Provider Architecture

Interface

class DictionaryProvider(ABC):
    """Abstract interface for dictionary data access."""

    # Syllable operations
    def is_valid_syllable(self, syllable: str) -> bool: ...
    def get_syllable_frequency(self, syllable: str) -> int: ...

    # Word operations
    def is_valid_word(self, word: str) -> bool: ...
    def get_word_frequency(self, word: str) -> int: ...
    def get_word_pos(self, word: str) -> str | None: ...

    # N-gram operations
    def get_bigram_probability(self, w1: str, w2: str) -> float: ...
    def get_trigram_probability(self, w1: str, w2: str, w3: str) -> float: ...
    def get_top_continuations(self, prefix: str, limit: int) -> list[str]: ...

    # Lifecycle
    def close(self) -> None: ...

Implementations

class SQLiteProvider(DictionaryProvider):
    """Disk-based provider using SQLite."""

    def __init__(
        self,
        database_path: Optional[str] = None,
        cache_size: int = DEFAULT_PROVIDER_CACHE_SIZE,
        check_same_thread: bool = False,
        pos_tagger: Optional["POSTaggerBase"] = None,
        pool_min_size: int = 1,
        pool_max_size: int = 5,
        pool_timeout: float = 5.0,
        pool_max_connection_age: float = 3600.0,
        sqlite_timeout: float = 30.0,
        cache_manager: Optional["CacheManager"] = None,
    ):
        if database_path is None:
            raise MissingDatabaseError("No database path provided.")
        self.database_path = database_path

    def is_valid_syllable(self, syllable: str) -> bool:
        with self.pool.checkout() as conn:
            cursor = conn.execute(
                "SELECT 1 FROM syllables WHERE syllable = ?",
                (syllable,)
            )
            return cursor.fetchone() is not None


class MemoryProvider(DictionaryProvider):
    """In-memory provider for high performance."""

    def __init__(
        self,
        syllables: Optional[Dict[str, int]] = None,
        words: Optional[Dict[str, int]] = None,
        bigrams: Optional[Dict[Tuple[str, str], float]] = None,
        trigrams: Optional[Dict[Tuple[str, str, str], float]] = None,
        word_pos: Optional[Dict[str, str]] = None,
    ):
        self.syllables = syllables or {}
        self.words = words or {}
        self.bigrams = bigrams or {}

    def is_valid_syllable(self, syllable: str) -> bool:
        return syllable in self.syllables

Cython Integration

Wrapper Pattern

Python wrappers with Cython fallback:

# normalize.py
try:
    # Try to import Cython version
    from .normalize_c import remove_zero_width_chars, reorder_myanmar_diacritics
except ImportError:
    # Pure Python fallbacks
    def remove_zero_width_chars(text: str) -> str:
        """Remove zero-width characters from text."""
        return "".join(c for c in text if c not in ZERO_WIDTH_CHARS)

    def reorder_myanmar_diacritics(text: str) -> str:
        """Reorder diacritics to canonical order."""
        # ... implementation
        return text

def normalize(text: str, form: str = "NFC", ...) -> str:
    """Main normalization function (public API)."""
    text = remove_zero_width_chars(text)
    text = reorder_myanmar_diacritics(text)
    # ... more normalization steps
    return text

Cython Source

# normalize_c.pyx
from cpython.unicode cimport PyUnicode_READ, PyUnicode_GET_LENGTH

cpdef str remove_zero_width_chars(str text):
    """Fast removal of zero-width characters."""
    cdef:
        Py_UCS4 c
        list result = []
        int i, n

    n = PyUnicode_GET_LENGTH(text)
    for i in range(n):
        c = text[i]
        if c not in ZERO_WIDTH_SET:
            result.append(chr(c))

    return "".join(result)

Error Handling

Graceful Degradation

Components fail gracefully:

class ComponentFactory:
    def create_semantic_checker(self) -> Optional[SemanticChecker]:
        """Create semantic checker with graceful degradation."""
        has_paths = self.config.semantic.model_path and self.config.semantic.tokenizer_path
        has_instances = self.config.semantic.model and self.config.semantic.tokenizer
        if not (has_paths or has_instances):
            return None

        try:
            checker = SemanticChecker(
                model_path=self.config.semantic.model_path,
                tokenizer_path=self.config.semantic.tokenizer_path,
            )
            self.logger.info("SemanticChecker initialized")
            return checker
        except Exception as e:
            # Log and continue without semantic checking
            self.logger.warning(
                f"SemanticChecker failed to initialize: {e}. "
                "Continuing without semantic checking."
            )
            return None

Exception Hierarchy

class MyanmarSpellcheckError(Exception):
    """Base exception for all spell checker errors."""

class ConfigurationError(MyanmarSpellcheckError): ...
class InvalidConfigError(ConfigurationError): ...

class DataLoadingError(MyanmarSpellcheckError): ...
class MissingDatabaseError(DataLoadingError): ...

class ProcessingError(MyanmarSpellcheckError): ...
class ValidationError(ProcessingError): ...
class TokenizationError(ProcessingError): ...
class NormalizationError(ProcessingError): ...

class ProviderError(MyanmarSpellcheckError): ...
class ConnectionPoolError(ProviderError): ...

class PipelineError(MyanmarSpellcheckError): ...
class IngestionError(PipelineError): ...
class PackagingError(PipelineError): ...

class ModelError(MyanmarSpellcheckError): ...
class ModelLoadError(ModelError): ...
class InferenceError(ModelError): ...

class MissingDependencyError(MyanmarSpellcheckError): ...
class InsufficientStorageError(MyanmarSpellcheckError): ...
class CacheError(MyanmarSpellcheckError): ...

Thread Safety

Connection Pooling

class ConnectionPool:
    """Thread-safe SQLite connection pool."""

    def __init__(
        self,
        database_path: Union[str, Path],
        pool_config: Optional[ConnectionPoolConfig] = None,
    ):
        self.database_path = Path(database_path)
        self.pool_config = pool_config or ConnectionPoolConfig()

    @contextmanager
    def checkout(self):
        """Get a connection from the pool."""
        conn = self._get_connection()
        try:
            yield conn
        finally:
            self._return_connection(conn)

Performance Considerations

Eager Initialization via ComponentFactory

SpellChecker.__init__ creates all components eagerly via ComponentFactory.create_all(). There are no lazy properties for core components:

class SpellChecker:
    def __init__(
        self,
        config: Optional[SpellCheckerConfig] = None,
        segmenter: Optional[Segmenter] = None,
        provider: Optional[DictionaryProvider] = None,
        syllable_validator: Optional[SyllableValidator] = None,
        word_validator: Optional[WordValidator] = None,
        context_validator: Optional[ContextValidator] = None,
        factory: Optional[ComponentFactoryProtocol] = None,
    ):
        # Resolve config, provider, segmenter...
        self._factory = factory or ComponentFactory(self.config)
        components = self._factory.create_all(self.provider, self.segmenter)

        # All components created eagerly
        self.syllable_validator = syllable_validator or components["syllable_validator"]
        self.word_validator = word_validator or components["word_validator"]
        self.context_validator = context_validator or components["context_validator"]

Lazy imports are used at the module level (e.g., TYPE_CHECKING guards, deferred import inside methods) to avoid circular imports and heavy dependencies, but component instances are created eagerly during __init__.

Caching

Caching is implemented at the provider level, not the validator level. ComponentFactory creates cached wrapper objects around the provider using LRU caches:

# ComponentFactory.create_cached_sources() wraps provider lookups
cached_sources = {
    "dictionary": CachedDictionaryLookup(provider, syllable_cache_size=..., word_cache_size=...),
    "frequency": CachedFrequencySource(provider, cache_size=...),
    "bigram": CachedBigramSource(provider, cache_size=...),
    "trigram": CachedTrigramSource(provider, cache_size=...),
}

Cache sizes are configured via SpellCheckerConfig.cache (AlgorithmCacheConfig).

Next Steps

Validation Pipeline - Pipeline details
Extension Points - How to extend
API Reference - Complete API docs

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Component Architecture

Core Components

Component Responsibilities

Design Patterns

Builder Pattern

Factory Pattern

Strategy Pattern

Chain of Responsibility

Provider Architecture

Interface

Implementations

Cython Integration

Wrapper Pattern

Cython Source

Error Handling

Graceful Degradation

Exception Hierarchy

Thread Safety

Connection Pooling

Performance Considerations

Eager Initialization via ComponentFactory

Caching

Next Steps

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Component Architecture

​Core Components

​Component Responsibilities

​Design Patterns

​Builder Pattern

​Factory Pattern

​Strategy Pattern

​Chain of Responsibility

​Provider Architecture

​Interface

​Implementations

​Cython Integration

​Wrapper Pattern

​Cython Source

​Error Handling

​Graceful Degradation

​Exception Hierarchy

​Thread Safety

​Connection Pooling

​Performance Considerations

​Eager Initialization via ComponentFactory

​Caching

​Next Steps

Component Architecture

Core Components

Component Responsibilities

Design Patterns

Builder Pattern

Factory Pattern

Strategy Pattern

Chain of Responsibility

Provider Architecture

Interface

Implementations

Cython Integration

Wrapper Pattern

Cython Source

Error Handling

Graceful Degradation

Exception Hierarchy

Thread Safety

Connection Pooling

Performance Considerations

Eager Initialization via ComponentFactory

Caching

Next Steps