Skip to main content
The system uses a layered architecture with pluggable components connected through factories, the Builder pattern, and strategy interfaces.

Component Architecture

Core Components

The core components are organized in layers: Configuration (Builder, Factory) → Validators (Syllable, Word, Context) → Algorithms (SymSpell, N-gram, POS, Semantic) → Infrastructure (Segmenter, Provider, Normalizer).

Component Responsibilities

ComponentResponsibility
SpellCheckerOrchestrate validation, manage lifecycle
ConfigurationStore settings, validation levels, thresholds
BuilderFluent construction of SpellChecker
ComponentFactoryCreate and wire components
SyllableValidatorLayer 1: syllable-level validation
WordValidatorLayer 2: word-level validation
ContextValidatorLayer 3: context validation via strategy pattern (created by ContextValidatorFactory; orchestrates Tone, Orthography, Syntactic, POS Sequence, Question Structure, Homophone, N-gram Context, Error Detection, and Semantic strategies)
SymSpellO(1) suggestion generation
NgramContextCheckerContext probability calculation (created by ContextCheckerFactory, distinct from ContextValidatorFactory)
POSTaggerPart-of-speech tagging
SemanticCheckerDeep context analysis
SegmenterText segmentation
ProviderDictionary data access
NormalizerText normalization
ResponseResult dataclass (text, errors, metadata)

Design Patterns

Builder Pattern

SpellCheckerBuilder provides fluent construction:
class SpellCheckerBuilder:
    def __init__(self):
        self._config = SpellCheckerConfig()
        self._provider = None
        self._segmenter = None

    def with_config(self, config: SpellCheckerConfig) -> "SpellCheckerBuilder":
        self._config = config
        return self

    def with_provider(self, provider: DictionaryProvider) -> "SpellCheckerBuilder":
        self._provider = provider
        return self

    def with_segmenter(self, segmenter: Segmenter) -> "SpellCheckerBuilder":
        self._segmenter = segmenter
        return self

    def build(self) -> SpellChecker:
        # Resolve provider (auto-detect SQLiteProvider or fallback to MemoryProvider)
        provider = self._provider
        if provider is None:
            try:
                provider = SQLiteProvider(
                    database_path=self._config.provider_config.database_path,
                    cache_size=self._config.provider_config.cache_size,
                )
            except MissingDatabaseError:
                if self._config.fallback_to_empty_provider:
                    provider = MemoryProvider()
                else:
                    raise

        # Resolve segmenter
        segmenter = self._segmenter or DefaultSegmenter(
            word_engine=self._config.word_engine,
        )

        # SpellChecker.__init__ uses ComponentFactory internally
        return SpellChecker(
            config=self._config,
            provider=provider,
            segmenter=segmenter,
        )

Factory Pattern

ComponentFactory creates configured components. It takes only config (not provider) — the provider and segmenter are passed to create_all():
class ComponentFactory:
    def __init__(self, config: SpellCheckerConfig):
        self.config = config

    def create_all(self, provider, segmenter) -> dict[str, Any]:
        """Create all SpellChecker components in proper dependency order.

        Returns a dict with keys:
            syllable_validator, word_validator, context_validator,
            viterbi_tagger, joint_segment_tagger, syntactic_rule_checker,
            semantic_checker, context_checker, name_heuristic, phonetic_hasher
        """
        # 1. Create algorithm components (SymSpell, ViterbiTagger, etc.)
        phonetic_hasher = self.create_phonetic_hasher()
        symspell = self.create_symspell(provider, phonetic_hasher)
        # ... POS probabilities, context checker, etc.

        # 2. Create validators with their actual constructor signatures
        syllable_validator = SyllableValidator(
            config=self.config,
            segmenter=segmenter,
            repository=provider,          # SyllableRepository interface
            symspell=symspell,
            syllable_rule_validator=SyllableRuleValidator(),
        )

        word_validator = WordValidator(
            config=self.config,
            segmenter=segmenter,
            word_repository=provider,     # WordRepository interface
            syllable_repository=provider, # SyllableRepository interface
            symspell=symspell,
            context_checker=context_checker,
            suggestion_strategy=suggestion_strategy,
        )

        # ContextValidator uses the Strategy Pattern -- grammar checking
        # (SyntacticRuleChecker) is one of several strategies, not a
        # separate validator.
        context_validator = ContextValidator(
            config=self.config,
            segmenter=segmenter,
            strategies=[...],  # List of ValidationStrategy instances
            name_heuristic=name_heuristic,
        )

        return {
            "syllable_validator": syllable_validator,
            "word_validator": word_validator,
            "context_validator": context_validator,
            # ... other components
        }
Note: SyntacticRuleChecker is not a separate validator. It is wrapped as a SyntacticValidationStrategy and passed into ContextValidator.strategies. The ContextValidator orchestrates the 9 strategies wired by ComponentFactory: Tone (10), Orthography (15), Syntactic (20), POS Sequence (30), Question Structure (40), Homophone (45), N-gram Context (50), Error Detection (65), and Semantic (70).

Strategy Pattern

Pluggable components implement common interfaces:
# Provider Strategy
class DictionaryProvider(ABC):
    def is_valid_syllable(self, syllable: str) -> bool: ...
    def is_valid_word(self, word: str) -> bool: ...
    def get_word_frequency(self, word: str) -> int: ...
    def get_bigram_probability(self, w1: str, w2: str) -> float: ...

# Segmenter Strategy
class Segmenter(ABC):
    def segment_syllables(self, text: str) -> list[str]: ...
    def segment_words(self, text: str) -> list[str]: ...

# POS Tagger Strategy
class POSTaggerBase(ABC):
    def tag_word(self, word: str) -> str: ...
    def tag_sequence(self, words: list[str]) -> list[str]: ...

Chain of Responsibility

Validators form a chain:
class SpellChecker:
    def check(self, text: str, level: ValidationLevel = ValidationLevel.SYLLABLE) -> Response:
        normalized = self.normalizer.normalize(text)

        all_errors = []

        # Layer 1: Syllable validation (always runs)
        syllable_errors = self._syllable_validator.validate(normalized)
        all_errors.extend(syllable_errors)

        # Layer 2+: Word and context validation (if level >= WORD)
        if level == ValidationLevel.WORD:
            word_errors = self._word_validator.validate(normalized)
            all_errors.extend(word_errors)

        return Response(text=text, errors=all_errors, ...)

Provider Architecture

Interface

class DictionaryProvider(ABC):
    """Abstract interface for dictionary data access."""

    # Syllable operations
    def is_valid_syllable(self, syllable: str) -> bool: ...
    def get_syllable_frequency(self, syllable: str) -> int: ...

    # Word operations
    def is_valid_word(self, word: str) -> bool: ...
    def get_word_frequency(self, word: str) -> int: ...
    def get_word_pos(self, word: str) -> str | None: ...

    # N-gram operations
    def get_bigram_probability(self, w1: str, w2: str) -> float: ...
    def get_trigram_probability(self, w1: str, w2: str, w3: str) -> float: ...
    def get_top_continuations(self, prefix: str, limit: int) -> list[str]: ...

    # Lifecycle
    def close(self) -> None: ...

Implementations

class SQLiteProvider(DictionaryProvider):
    """Disk-based provider using SQLite."""

    def __init__(
        self,
        database_path: Optional[str] = None,
        cache_size: int = DEFAULT_PROVIDER_CACHE_SIZE,
        check_same_thread: bool = False,
        pos_tagger: Optional["POSTaggerBase"] = None,
        pool_min_size: int = 1,
        pool_max_size: int = 5,
        pool_timeout: float = 5.0,
        pool_max_connection_age: float = 3600.0,
        sqlite_timeout: float = 30.0,
        cache_manager: Optional["CacheManager"] = None,
    ):
        if database_path is None:
            raise MissingDatabaseError("No database path provided.")
        self.database_path = database_path

    def is_valid_syllable(self, syllable: str) -> bool:
        with self.pool.checkout() as conn:
            cursor = conn.execute(
                "SELECT 1 FROM syllables WHERE syllable = ?",
                (syllable,)
            )
            return cursor.fetchone() is not None


class MemoryProvider(DictionaryProvider):
    """In-memory provider for high performance."""

    def __init__(
        self,
        syllables: Optional[Dict[str, int]] = None,
        words: Optional[Dict[str, int]] = None,
        bigrams: Optional[Dict[Tuple[str, str], float]] = None,
        trigrams: Optional[Dict[Tuple[str, str, str], float]] = None,
        word_pos: Optional[Dict[str, str]] = None,
    ):
        self.syllables = syllables or {}
        self.words = words or {}
        self.bigrams = bigrams or {}

    def is_valid_syllable(self, syllable: str) -> bool:
        return syllable in self.syllables

Cython Integration

Wrapper Pattern

Python wrappers with Cython fallback:
# normalize.py
try:
    # Try to import Cython version
    from .normalize_c import remove_zero_width_chars, reorder_myanmar_diacritics
except ImportError:
    # Pure Python fallbacks
    def remove_zero_width_chars(text: str) -> str:
        """Remove zero-width characters from text."""
        return "".join(c for c in text if c not in ZERO_WIDTH_CHARS)

    def reorder_myanmar_diacritics(text: str) -> str:
        """Reorder diacritics to canonical order."""
        # ... implementation
        return text

def normalize(text: str, form: str = "NFC", ...) -> str:
    """Main normalization function (public API)."""
    text = remove_zero_width_chars(text)
    text = reorder_myanmar_diacritics(text)
    # ... more normalization steps
    return text

Cython Source

# normalize_c.pyx
from cpython.unicode cimport PyUnicode_READ, PyUnicode_GET_LENGTH

cpdef str remove_zero_width_chars(str text):
    """Fast removal of zero-width characters."""
    cdef:
        Py_UCS4 c
        list result = []
        int i, n

    n = PyUnicode_GET_LENGTH(text)
    for i in range(n):
        c = text[i]
        if c not in ZERO_WIDTH_SET:
            result.append(chr(c))

    return "".join(result)

Error Handling

Graceful Degradation

Components fail gracefully:
class ComponentFactory:
    def create_semantic_checker(self) -> Optional[SemanticChecker]:
        """Create semantic checker with graceful degradation."""
        has_paths = self.config.semantic.model_path and self.config.semantic.tokenizer_path
        has_instances = self.config.semantic.model and self.config.semantic.tokenizer
        if not (has_paths or has_instances):
            return None

        try:
            checker = SemanticChecker(
                model_path=self.config.semantic.model_path,
                tokenizer_path=self.config.semantic.tokenizer_path,
            )
            self.logger.info("SemanticChecker initialized")
            return checker
        except Exception as e:
            # Log and continue without semantic checking
            self.logger.warning(
                f"SemanticChecker failed to initialize: {e}. "
                "Continuing without semantic checking."
            )
            return None

Exception Hierarchy

class MyanmarSpellcheckError(Exception):
    """Base exception for all spell checker errors."""

class ConfigurationError(MyanmarSpellcheckError): ...
class InvalidConfigError(ConfigurationError): ...

class DataLoadingError(MyanmarSpellcheckError): ...
class MissingDatabaseError(DataLoadingError): ...

class ProcessingError(MyanmarSpellcheckError): ...
class ValidationError(ProcessingError): ...
class TokenizationError(ProcessingError): ...
class NormalizationError(ProcessingError): ...

class ProviderError(MyanmarSpellcheckError): ...
class ConnectionPoolError(ProviderError): ...

class PipelineError(MyanmarSpellcheckError): ...
class IngestionError(PipelineError): ...
class PackagingError(PipelineError): ...

class ModelError(MyanmarSpellcheckError): ...
class ModelLoadError(ModelError): ...
class InferenceError(ModelError): ...

class MissingDependencyError(MyanmarSpellcheckError): ...
class InsufficientStorageError(MyanmarSpellcheckError): ...
class CacheError(MyanmarSpellcheckError): ...

Thread Safety

Connection Pooling

class ConnectionPool:
    """Thread-safe SQLite connection pool."""

    def __init__(
        self,
        database_path: Union[str, Path],
        pool_config: Optional[ConnectionPoolConfig] = None,
    ):
        self.database_path = Path(database_path)
        self.pool_config = pool_config or ConnectionPoolConfig()

    @contextmanager
    def checkout(self):
        """Get a connection from the pool."""
        conn = self._get_connection()
        try:
            yield conn
        finally:
            self._return_connection(conn)

Performance Considerations

Eager Initialization via ComponentFactory

SpellChecker.__init__ creates all components eagerly via ComponentFactory.create_all(). There are no lazy properties for core components:
class SpellChecker:
    def __init__(
        self,
        config: Optional[SpellCheckerConfig] = None,
        segmenter: Optional[Segmenter] = None,
        provider: Optional[DictionaryProvider] = None,
        syllable_validator: Optional[SyllableValidator] = None,
        word_validator: Optional[WordValidator] = None,
        context_validator: Optional[ContextValidator] = None,
        factory: Optional[ComponentFactoryProtocol] = None,
    ):
        # Resolve config, provider, segmenter...
        self._factory = factory or ComponentFactory(self.config)
        components = self._factory.create_all(self.provider, self.segmenter)

        # All components created eagerly
        self.syllable_validator = syllable_validator or components["syllable_validator"]
        self.word_validator = word_validator or components["word_validator"]
        self.context_validator = context_validator or components["context_validator"]
Lazy imports are used at the module level (e.g., TYPE_CHECKING guards, deferred import inside methods) to avoid circular imports and heavy dependencies, but component instances are created eagerly during __init__.

Caching

Caching is implemented at the provider level, not the validator level. ComponentFactory creates cached wrapper objects around the provider using LRU caches:
# ComponentFactory.create_cached_sources() wraps provider lookups
cached_sources = {
    "dictionary": CachedDictionaryLookup(provider, syllable_cache_size=..., word_cache_size=...),
    "frequency": CachedFrequencySource(provider, cache_size=...),
    "bigram": CachedBigramSource(provider, cache_size=...),
    "trigram": CachedTrigramSource(provider, cache_size=...),
}
Cache sizes are configured via SpellCheckerConfig.cache (AlgorithmCacheConfig).

Next Steps