Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
This section documents every algorithm in the library — from O(1) SymSpell lookups through Viterbi decoding to optional ONNX-based semantic models — along with their complexity characteristics and tuning parameters.
Overview
mySpellChecker uses a multi-layered approach with specialized algorithms at each level:
| Layer | Algorithm | Purpose | Complexity |
|---|
| Syllable | Rule-based FSM | Structure validation | O(n) |
| Word | SymSpell | Fast suggestions | O(1) |
| Grammar | POS + Rules | Syntactic validation | O(n) |
| Context | N-gram LM | Probability checking | O(1) |
| Semantic | ONNX Neural | Deep understanding | O(n) |
Contents
Core Algorithms
Suggestion System
Context & Grammar
POS and Tagging
Text Processing
Entity & Pattern Recognition
Algorithm Selection
When to Use Each Algorithm
The validation pipeline processes text through these stages in order:
| Stage | Algorithm | When It Runs | What It Does |
|---|
| Normalization | Unicode NFC + zero-width removal | Always | Cleans input text |
| Segmentation | Rule-based syllable breaking | Always | Splits text into syllables |
| Layer 1: Syllable | FSM + dictionary lookup | Always | Validates syllable structure |
| SymSpell | On invalid syllables | Generates correction suggestions |
| Layer 2: Word | SymSpell | ValidationLevel.WORD | Validates words, suggests fixes |
| Layer 2.5: Grammar | Viterbi POS + rule engine | ValidationLevel.WORD | Checks syntactic correctness |
| Layer 3: Context | N-gram probabilities | ValidationLevel.WORD | Checks word-in-context fitness |
| ONNX semantic model | When enabled | Deep context understanding |
Time Complexity
| Algorithm | Best | Average | Worst |
|---|
| Normalization | O(n) | O(n) | O(n) |
| Segmentation | O(n) | O(n) | O(n) |
| Syllable Validation | O(1) | O(1) | O(k)* |
| SymSpell Lookup | O(1) | O(1) | O(1) |
| Viterbi POS | O(N·B²·T) | O(N·B²·T) | O(N·B²·T)* |
| N-gram Lookup | O(1) | O(1) | O(1) |
*k = number of rule checks; N = sequence length, B = beam width, T = avg tags per word (beam pruning reduces from O(nT²))
Space Complexity
| Algorithm | Memory | Notes |
|---|
| SymSpell | ~50-300MB | Pre-computed deletes (varies by dictionary size) |
| N-gram Model | ~50MB | Indexed probabilities |
| Viterbi | O(nT) | Dynamic programming table |
| Semantic | ~300MB | ONNX model |
Implementation Notes
Cython Optimizations
Performance-critical algorithms are implemented in Cython:
# Pure Python (normalize.py)
def remove_zero_width(text: str) -> str:
return ''.join(c for c in text if c not in ZERO_WIDTH)
# Cython (normalize_c.pyx) - operates on raw UTF-8 bytes via C++ std::string
cpdef str remove_zero_width_chars(str text):
cdef string s = text.encode('utf-8')
cdef string result
result.reserve(s.length())
cdef int i = 0
cdef int n = s.length()
cdef int cp
cdef int char_len
cdef pair[int, int] decoded
while i < n:
decoded = decode_utf8_char(s.c_str(), i, n)
cp = decoded.first
char_len = decoded.second
if ZERO_WIDTH_CHARS.count(cp) == 0:
result.append(s.substr(i, char_len))
i += char_len
Cython Word Segmentation
The word segmenter (word_segment.pyx) uses Viterbi decoding with C++ unordered_map for O(1) probability lookups and memory-mapped model loading for fork-safe parallel processing:
# mmap_reader.pyx - Used by word_segment.pyx for Viterbi word segmentation
from libcpp.unordered_map cimport unordered_map
from libcpp.string cimport string
cdef class MMapSegmentationReader:
"""
Fast memory-mapped dictionary reader.
Provides O(1) average lookups for unigrams and bigrams.
All lookup methods are GIL-free for maximum parallelism.
"""
...
Quick Reference
Algorithm Parameters
| Algorithm | Key Parameter | Default | Range |
|---|
| SymSpell | max_edit_distance | 2 | 1-3 |
| SymSpell | prefix_length | 10 | 4-10 |
| N-gram | bigram_threshold | 0.0001 | 0-1 |
| Viterbi | min_prob | 1e-10 | 1e-15 to 1e-5 |
Tuning Guidelines
- Speed priority: Use edit distance 1, disable context
- Accuracy priority: Use edit distance 2, enable context
- Memory constrained: Use SQLite provider, disable semantic
See Also