Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
All 11 Cython extensions are listed here with their architecture patterns, build instructions, performance benchmarks, and troubleshooting steps.
Overview
mySpellChecker uses Cython to compile performance-critical Python code to C++ extensions. The project contains 11 Cython extensions:
| Extension | Location | Purpose | Speedup | OpenMP |
|---|
normalize_c.pyx | text/ | Text normalization, zero-width char removal | 2-3x | No |
edit_distance_c.pyx | algorithms/distance/ | Levenshtein, Damerau-Levenshtein distance | 4-6x | No |
viterbi.pyx | algorithms/ | Viterbi POS tagger | 3-4x | No |
syllable_rules_c.pyx | core/ | Syllable structure validation | 2-3x | No |
batch_processor.pyx | data_pipeline/ | Parallel batch processing | 5-10x | Yes |
frequency_counter.pyx | data_pipeline/ | Fast frequency counting | 3-5x | No |
word_segment.pyx | tokenizers/cython/ | Word segmentation | 2-3x | No |
mmap_reader.pyx | tokenizers/cython/ | Memory-mapped file reading | 2-4x | No |
ingester_c.pyx | data_pipeline/ | Fast corpus ingestion | 2-3x | No |
repair_c.pyx | data_pipeline/ | Segmentation repair | 2-3x | No |
tsv_reader_c.pyx | data_pipeline/ | Fast TSV file parsing | 2-3x | No |
Building Extensions
Quick Start
Extensions are automatically built during installation:
# Install with development dependencies (builds extensions)
pip install -e ".[dev]"
# Or explicitly build extensions
python setup.py build_ext --inplace
Requirements
All Platforms:
- Python 3.10+
- Cython 3.0+
- C++ compiler (gcc 9+, clang 10+, or MSVC 2019+)
macOS (for OpenMP support):
# Install libomp for parallel processing
brew install libomp
# Apple Silicon (M1/M2/M3)
# libomp installed to: /opt/homebrew/opt/libomp
# Intel Mac
# libomp installed to: /usr/local/opt/libomp
Linux:
# OpenMP comes with gcc
sudo apt install build-essential # Debian/Ubuntu
sudo yum groupinstall "Development Tools" # RHEL/CentOS
Build Outputs
| File Type | Example | Purpose | Git Tracked |
|---|
.pyx | normalize_c.pyx | Cython source | Yes |
.pxd | normalize_c.pxd | C-level declarations | Yes |
.py | normalize.py | Python wrapper (some modules include fallback) | Yes |
.cpp | normalize_c.cpp | Generated C++ (build artifact) | No |
.so | normalize_c.*.so | Compiled binary | No |
Debug Build
# Build with debug symbols
python setup.py build_ext --inplace --debug
Clean Build
# Remove compiled extensions
find . -name "*.so" -delete
find . -name "*.cpp" -path "*/myspellchecker/*" -delete
find . -name "*.pyc" -delete
rm -rf build/ *.egg-info/
# Rebuild from scratch
python setup.py build_ext --inplace
Architecture Patterns
1. Wrapper Pattern (Cython with Fallback)
Some Cython modules have Python wrappers that provide fallback when Cython isn’t available, but not all. There are two patterns in use:
Pattern A: Hard import (no fallback) — used by normalize.py:
# normalize.py -- hard import, NO try/except fallback
from myspellchecker.text.normalize_c import (
remove_zero_width_chars as c_remove_zero_width,
)
from myspellchecker.text.normalize_c import (
reorder_myanmar_diacritics as c_reorder_diacritics,
)
normalize.py requires the Cython extension to be compiled. It does not provide pure Python fallbacks for core Cython functions. For systems without a C++ compiler, install via a pre-built wheel.
Pattern B: try/except with fallback — used by edit_distance.py and syllable_rules.py:
# edit_distance.py -- try/except with full Python fallback
try:
from myspellchecker.algorithms.distance import edit_distance_c
_HAS_CYTHON_EDIT_DISTANCE = True
except ImportError:
_HAS_CYTHON_EDIT_DISTANCE = False
# Each function checks the flag and falls back to pure Python:
def levenshtein_distance(s1: str, s2: str) -> int:
if _HAS_CYTHON_EDIT_DISTANCE:
return edit_distance_c.levenshtein_distance(s1, s2)
# ... pure Python implementation follows ...
# syllable_rules.py -- try/except with class-level fallback
try:
from myspellchecker.core.syllable_rules_c import (
SyllableRuleValidator as _SyllableRuleValidatorCython,
)
SyllableRuleValidator = _SyllableRuleValidatorCython
except ImportError:
SyllableRuleValidator = _SyllableRuleValidatorPython
Benefits of Pattern B (where it exists):
- Modules using this pattern work without Cython compilation
- Tests run without compilation for those modules
- Gradual migration path
Important: Since normalize.py (Pattern A) is a core dependency used throughout the library, the package effectively requires Cython extensions to be compiled. Install via pre-built wheels on systems without a C++ compiler.
Checking Active Implementation
# Check if Cython extension loaded
try:
from myspellchecker.text.normalize_c import remove_zero_width_chars
print("Cython normalize: loaded")
except ImportError:
print("Cython normalize: not available")
# NOTE: There is NO Python fallback for normalize. The normalize.py
# wrapper uses hard imports (Pattern A), so if the Cython extension
# is not compiled, importing normalize.py will raise ImportError.
from myspellchecker.algorithms.viterbi import _HAS_CYTHON_VITERBI
print(f"Cython viterbi: {_HAS_CYTHON_VITERBI}")
2. Declaration Files (.pxd)
.pxd files declare C-level function signatures for cross-module imports:
# normalize_c.pxd
cdef str c_remove_zero_width_chars(str text)
cdef bint c_has_myanmar_script(str text)
# batch_processor.pyx (using the declarations)
from myspellchecker.text.normalize_c cimport c_remove_zero_width_chars
def process_batch(list texts):
cdef str text
for text in texts:
# Direct C-level call (no Python overhead)
text = c_remove_zero_width_chars(text)
3. OpenMP Parallel Processing
Only batch_processor.pyx uses OpenMP for parallelization:
# batch_processor.pyx
from cython.parallel import prange
def process_batch_parallel(list texts, int num_threads=4):
cdef int i, n = len(texts)
cdef list results = [None] * n
# Parallel loop with OpenMP
for i in prange(n, nogil=True, num_threads=num_threads):
with gil:
results[i] = process_single(texts[i])
return results
Note: OpenMP is optional. If libomp isn’t installed, the library falls back to single-threaded processing.
4. C++ Integration
All extensions use language="c++" for STL containers:
# distutils: language = c++
from libcpp.string cimport string
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
cdef class EditDistanceCalculator:
cdef unordered_map[string, int] cache
cdef int calculate(self, string s1, string s2) nogil:
# C++ implementation with STL
pass
Extension Details
Text Normalization (normalize_c.pyx)
from myspellchecker.text.normalize import (
remove_zero_width_chars,
normalize,
normalize_for_lookup,
)
clean = remove_zero_width_chars("မြန်\u200bမာ") # "မြန်မာ"
normalized = normalize("မြန်မာ")
lookup_form = normalize_for_lookup("မြန်မာ")
Uses C++ unordered_set for O(1) character lookups with pre-compiled character sets for Myanmar ranges.
Edit Distance (edit_distance_c.pyx)
from myspellchecker.algorithms.distance.edit_distance_c import (
levenshtein_distance,
damerau_levenshtein_distance,
weighted_damerau_levenshtein_distance,
set_myanmar_substitution_costs,
)
dist = levenshtein_distance("မြန်", "မြမ်") # 1
dist = damerau_levenshtein_distance("AB", "BA") # 1 (includes transposition)
Row-based DP for O(min(m,n)) space complexity with proper UTF-8 handling for Myanmar’s 3-byte characters.
Syllable Validation (syllable_rules_c.pyx)
from myspellchecker.core.syllable_rules import SyllableRuleValidator
validator = SyllableRuleValidator(strict=True)
is_valid = validator.validate("မြန်") # True
is_valid = validator.validate("ြမန်") # False (medial without consonant)
22+ validation checks per syllable with pre-computed character sets.
Batch Processor (batch_processor.pyx)
from myspellchecker.data_pipeline.batch_processor import process_batch
results = process_batch(texts, num_threads=4)
OpenMP parallel for directives for multi-threading with GIL-free C++ processing.
Viterbi POS Tagger (viterbi.pyx)
from myspellchecker.algorithms.viterbi import ViterbiTagger
tagger = ViterbiTagger()
tags = tagger.tag_sequence(["သူ", "က", "သွား", "တယ်"])
# ["N", "P_SUBJ", "V", "P_SENT"]
Log-space computation for numerical stability with optimized backtracking.
Benchmark results (10,000 iterations):
| Operation | Pure Python | Cython | Speedup |
|---|
| Levenshtein (short) | 100μs | 10μs | 10x |
| Levenshtein (long) | 1ms | 50μs | 20x |
| Text normalization | 50μs | 5μs | 10x |
| Syllable validation | 80μs | 10μs | 8x |
| Batch (1000 texts) | 5s | 0.5s | 10x |
Testing Cython Code
Running Tests
# Run tests for Cython modules
pytest tests/test_normalize.py tests/test_edit_distance.py -v
# Run with coverage (tests both Python and Cython paths)
pytest tests/ --cov=myspellchecker --cov-report=html
Testing Both Implementations
Tests should verify Python/Cython consistency:
# test_normalize.py
import pytest
from myspellchecker.text.normalize_c import remove_zero_width_chars
def test_remove_zero_width_basic():
text = "hello\u200bworld"
result = remove_zero_width_chars(text)
assert result == "helloworld"
@pytest.mark.skipif(not CYTHON_AVAILABLE, reason="Cython not compiled")
def test_cython_zero_width_edge_cases():
from myspellchecker.text.normalize_c import remove_zero_width_chars
test_cases = [
("hello\u200bworld", "helloworld"), # ZWSP
("မြန်မာ\u200b", "မြန်မာ"), # Trailing ZWSP
("", ""), # Empty string
]
for text, expected in test_cases:
assert remove_zero_width_chars(text) == expected
Adding New Cython Modules
-
Create
.pyx file:
src/myspellchecker/new_module/fast_impl.pyx
-
Create
.pxd file (if needed for cross-module imports):
src/myspellchecker/new_module/fast_impl.pxd
-
Add to
setup.py:
Extension(
name="myspellchecker.new_module.fast_impl",
sources=["src/myspellchecker/new_module/fast_impl.pyx"],
language="c++",
),
-
Create Python wrapper:
# new_module/__init__.py
try:
from .fast_impl import function
except ImportError:
def function(...):
# Fallback
pass
-
Add tests and rebuild:
python setup.py build_ext --inplace
pytest tests/test_fast_impl.py
Troubleshooting
”Cannot find Cython” during build
”libomp not found” on macOS
brew install libomp
# If build still fails:
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
python setup.py build_ext --inplace
Module import fails after changes
python setup.py build_ext --inplace --force
Compiler errors (C++ standard)
export CXXFLAGS="-std=c++11"
python setup.py build_ext --inplace
Segmentation fault in Cython code
Common causes: releasing GIL while accessing Python objects, buffer overflow in typed memoryviews, use-after-free in C++ containers.
# Build with debug symbols and debug with lldb
python setup.py build_ext --inplace --debug
lldb python -c "import myspellchecker"
Best Practices
- Provide Python fallback where feasible - use Pattern B (try/except) for modules where graceful degradation makes sense. Critical-path modules like
normalize use Pattern A (hard imports) where Cython is required for correctness and performance
- Use type declarations - fully typed
cdef functions for speed
- Minimize GIL releases - only release when safe (pure C/C++ operations)
- Use memory views for arrays - typed
double[:] for efficient array access
- Document Cython-specific behavior - note which implementation is active
See Also