Skip to main content
This section documents every algorithm in the library — from O(1) SymSpell lookups through Viterbi decoding to optional ONNX-based semantic models — along with their complexity characteristics and tuning parameters.

Overview

mySpellChecker uses a multi-layered approach with specialized algorithms at each level:
LayerAlgorithmPurposeComplexity
SyllableRule-based FSMStructure validationO(n)
WordSymSpellFast suggestionsO(1)
GrammarPOS + RulesSyntactic validationO(n)
ContextN-gram LMProbability checkingO(1)
SemanticONNX NeuralDeep understandingO(n)

Contents

Core Algorithms

Suggestion System

Context & Grammar

POS and Tagging

Text Processing

Entity & Pattern Recognition

Algorithm Selection

When to Use Each Algorithm

The validation pipeline processes text through these stages in order:
StageAlgorithmWhen It RunsWhat It Does
NormalizationUnicode NFC + zero-width removalAlwaysCleans input text
SegmentationRule-based syllable breakingAlwaysSplits text into syllables
Layer 1: SyllableFSM + dictionary lookupAlwaysValidates syllable structure
SymSpellOn invalid syllablesGenerates correction suggestions
Layer 2: WordSymSpellValidationLevel.WORDValidates words, suggests fixes
Layer 2.5: GrammarViterbi POS + rule engineValidationLevel.WORDChecks syntactic correctness
Layer 3: ContextN-gram probabilitiesValidationLevel.WORDChecks word-in-context fitness
ONNX semantic modelWhen enabledDeep context understanding

Performance Characteristics

Time Complexity

AlgorithmBestAverageWorst
NormalizationO(n)O(n)O(n)
SegmentationO(n)O(n)O(n)
Syllable ValidationO(1)O(1)O(k)*
SymSpell LookupO(1)O(1)O(1)
Viterbi POSO(N·B²·T)O(N·B²·T)O(N·B²·T)*
N-gram LookupO(1)O(1)O(1)
*k = number of rule checks; N = sequence length, B = beam width, T = avg tags per word (beam pruning reduces from O(nT²))

Space Complexity

AlgorithmMemoryNotes
SymSpell~50-300MBPre-computed deletes (varies by dictionary size)
N-gram Model~50MBIndexed probabilities
ViterbiO(nT)Dynamic programming table
Semantic~300MBONNX model

Implementation Notes

Cython Optimizations

Performance-critical algorithms are implemented in Cython:
# Pure Python (normalize.py)
def remove_zero_width(text: str) -> str:
    return ''.join(c for c in text if c not in ZERO_WIDTH)

# Cython (normalize_c.pyx) - operates on raw UTF-8 bytes via C++ std::string
cpdef str remove_zero_width_chars(str text):
    cdef string s = text.encode('utf-8')
    cdef string result
    result.reserve(s.length())

    cdef int i = 0
    cdef int n = s.length()
    cdef int cp
    cdef int char_len
    cdef pair[int, int] decoded

    while i < n:
        decoded = decode_utf8_char(s.c_str(), i, n)
        cp = decoded.first
        char_len = decoded.second

        if ZERO_WIDTH_CHARS.count(cp) == 0:
            result.append(s.substr(i, char_len))

        i += char_len

Cython Word Segmentation

The word segmenter (word_segment.pyx) uses Viterbi decoding with C++ unordered_map for O(1) probability lookups and memory-mapped model loading for fork-safe parallel processing:
# mmap_reader.pyx - Used by word_segment.pyx for Viterbi word segmentation
from libcpp.unordered_map cimport unordered_map
from libcpp.string cimport string

cdef class MMapSegmentationReader:
    """
    Fast memory-mapped dictionary reader.

    Provides O(1) average lookups for unigrams and bigrams.
    All lookup methods are GIL-free for maximum parallelism.
    """
    ...

Quick Reference

Algorithm Parameters

AlgorithmKey ParameterDefaultRange
SymSpellmax_edit_distance21-3
SymSpellprefix_length104-10
N-grambigram_threshold0.00010-1
Viterbimin_prob1e-101e-15 to 1e-5

Tuning Guidelines

  • Speed priority: Use edit distance 1, disable context
  • Accuracy priority: Use edit distance 2, enable context
  • Memory constrained: Use SQLite provider, disable semantic

See Also