Overview

This section documents every algorithm in the library — from O(1) SymSpell lookups through Viterbi decoding to optional ONNX-based semantic models — along with their complexity characteristics and tuning parameters.

mySpellChecker uses a multi-layered approach with specialized algorithms at each level:

Layer	Algorithm	Purpose	Complexity
Syllable	Rule-based FSM	Structure validation	O(n)
Word	SymSpell	Fast suggestions	O(1)
Grammar	POS + Rules	Syntactic validation	O(n)
Context	N-gram LM	Probability checking	O(1)
Semantic	ONNX Neural	Deep understanding	O(n)

Core Algorithms

SymSpell - O(1) spelling correction algorithm
Edit Distance - Levenshtein, Damerau-Levenshtein, weighted algorithms
N-gram Model - Statistical language modeling
Viterbi - POS tagging with HMM

Suggestion System

Suggestion Strategy - Pluggable suggestion generation interface
Suggestion Ranking - Multi-factor ranking algorithms

Context & Grammar

Context-Aware Validation - Context and grammar validation overview
Grammar Rules - Syntactic grammar rule engine
Semantic Analysis - Neural semantic validation

POS and Tagging

POS Disambiguator - Context-based POS disambiguation rules
Joint Segment Tagger - Unified segmentation and POS tagging
Tone Disambiguation - Context-aware tone mark inference

Text Processing

Normalization - Text preprocessing and Unicode handling
Segmentation - Word segmentation algorithms
Syllable Segmentation - Rule-based syllable breaking

Entity & Pattern Recognition

Named Entity Recognition - Heuristic NER for proper nouns
Phonetic Matching - Sound-based similarity matching

Algorithm Selection

When to Use Each Algorithm

The validation pipeline processes text through these stages in order:

Stage	Algorithm	When It Runs	What It Does
Normalization	Unicode NFC + zero-width removal	Always	Cleans input text
Segmentation	Rule-based syllable breaking	Always	Splits text into syllables
Layer 1: Syllable	FSM + dictionary lookup	Always	Validates syllable structure
	SymSpell	On invalid syllables	Generates correction suggestions
Layer 2: Word	SymSpell	`ValidationLevel.WORD`	Validates words, suggests fixes
Layer 2.5: Grammar	Viterbi POS + rule engine	`ValidationLevel.WORD`	Checks syntactic correctness
Layer 3: Context	N-gram probabilities	`ValidationLevel.WORD`	Checks word-in-context fitness
	ONNX semantic model	When enabled	Deep context understanding

Performance Characteristics

Time Complexity

Algorithm	Best	Average	Worst
Normalization	O(n)	O(n)	O(n)
Segmentation	O(n)	O(n)	O(n)
Syllable Validation	O(1)	O(1)	O(k)*
SymSpell Lookup	O(1)	O(1)	O(1)
Viterbi POS	O(N·B²·T)	O(N·B²·T)	O(N·B²·T)*
N-gram Lookup	O(1)	O(1)	O(1)

*k = number of rule checks; N = sequence length, B = beam width, T = avg tags per word (beam pruning reduces from O(nT²))

Space Complexity

Algorithm	Memory	Notes
SymSpell	~50-300MB	Pre-computed deletes (varies by dictionary size)
N-gram Model	~50MB	Indexed probabilities
Viterbi	O(nT)	Dynamic programming table
Semantic	~300MB	ONNX model

Implementation Notes

Cython Optimizations

Performance-critical algorithms are implemented in Cython:

# Pure Python (normalize.py)
def remove_zero_width(text: str) -> str:
    return ''.join(c for c in text if c not in ZERO_WIDTH)

# Cython (normalize_c.pyx) - operates on raw UTF-8 bytes via C++ std::string
cpdef str remove_zero_width_chars(str text):
    cdef string s = text.encode('utf-8')
    cdef string result
    result.reserve(s.length())

    cdef int i = 0
    cdef int n = s.length()
    cdef int cp
    cdef int char_len
    cdef pair[int, int] decoded

    while i < n:
        decoded = decode_utf8_char(s.c_str(), i, n)
        cp = decoded.first
        char_len = decoded.second

        if ZERO_WIDTH_CHARS.count(cp) == 0:
            result.append(s.substr(i, char_len))

        i += char_len

Cython Word Segmentation

The word segmenter (word_segment.pyx) uses Viterbi decoding with C++ unordered_map for O(1) probability lookups and memory-mapped model loading for fork-safe parallel processing:

# mmap_reader.pyx - Used by word_segment.pyx for Viterbi word segmentation
from libcpp.unordered_map cimport unordered_map
from libcpp.string cimport string

cdef class MMapSegmentationReader:
    """
    Fast memory-mapped dictionary reader.

    Provides O(1) average lookups for unigrams and bigrams.
    All lookup methods are GIL-free for maximum parallelism.
    """
    ...

Quick Reference

Algorithm Parameters

Algorithm	Key Parameter	Default	Range
SymSpell	max_edit_distance	2	1-3
SymSpell	prefix_length	10	4-10
N-gram	bigram_threshold	0.0001	0-1
Viterbi	min_prob	1e-10	1e-15 to 1e-5

Tuning Guidelines

Speed priority: Use edit distance 1, disable context
Accuracy priority: Use edit distance 2, enable context
Memory constrained: Use SQLite provider, disable semantic

Overview

Overview

Contents

Core Algorithms

Suggestion System

Context & Grammar

POS and Tagging

Text Processing

Entity & Pattern Recognition

Algorithm Selection

When to Use Each Algorithm

Performance Characteristics

Time Complexity

Space Complexity

Implementation Notes

Cython Optimizations

Cython Word Segmentation

Quick Reference

Algorithm Parameters

Tuning Guidelines

See Also

​Overview

​Contents

​Core Algorithms

​Suggestion System

​Context & Grammar

​POS and Tagging

​Text Processing

​Entity & Pattern Recognition

​Algorithm Selection

​When to Use Each Algorithm

​Performance Characteristics

​Time Complexity

​Space Complexity

​Implementation Notes

​Cython Optimizations

​Cython Word Segmentation

​Quick Reference

​Algorithm Parameters

​Tuning Guidelines

​See Also

Overview

Contents

Core Algorithms

Suggestion System

Context & Grammar

POS and Tagging

Text Processing

Entity & Pattern Recognition

Algorithm Selection

When to Use Each Algorithm

Performance Characteristics

Time Complexity

Space Complexity

Implementation Notes

Cython Optimizations

Cython Word Segmentation

Quick Reference

Algorithm Parameters

Tuning Guidelines

See Also