Overview
mySpellChecker uses Cython to compile performance-critical Python code to C++ extensions. The project contains 11 Cython extensions:| Extension | Location | Purpose | Speedup | OpenMP |
|---|---|---|---|---|
normalize_c.pyx | text/ | Text normalization, zero-width char removal | ~10x | No |
edit_distance_c.pyx | algorithms/distance/ | Levenshtein, Damerau-Levenshtein distance | ~10x | No |
viterbi.pyx | algorithms/ | Viterbi POS tagger | ~5x | No |
syllable_rules_c.pyx | core/ | Syllable structure validation | ~8x | No |
batch_processor.pyx | data_pipeline/ | Parallel batch processing | ~4x | Yes |
frequency_counter.pyx | data_pipeline/ | Fast frequency counting | ~6x | No |
word_segment.pyx | tokenizers/cython/ | Word segmentation | ~5x | No |
mmap_reader.pyx | tokenizers/cython/ | Memory-mapped file reading | ~3x | No |
ingester_c.pyx | data_pipeline/ | Fast corpus ingestion | ~4x | No |
repair_c.pyx | data_pipeline/ | Segmentation repair | ~5x | No |
tsv_reader_c.pyx | data_pipeline/ | Fast TSV file parsing | ~4x | No |
Building Extensions
Quick Start
Extensions are automatically built during installation:Requirements
All Platforms:- Python 3.10+
- Cython 3.0+
- C++ compiler (gcc 9+, clang 10+, or MSVC 2019+)
Build Outputs
| File Type | Example | Purpose | Git Tracked |
|---|---|---|---|
.pyx | normalize_c.pyx | Cython source | Yes |
.pxd | normalize_c.pxd | C-level declarations | Yes |
.py | normalize.py | Python wrapper (some modules include fallback) | Yes |
.cpp | normalize_c.cpp | Generated C++ (build artifact) | No |
.so | normalize_c.*.so | Compiled binary | No |
Debug Build
Clean Build
Architecture Patterns
1. Wrapper Pattern (Cython with Fallback)
Some Cython modules have Python wrappers that provide fallback when Cython isn’t available, but not all. There are two patterns in use: Pattern A: Hard import (no fallback) — used bynormalize.py:
normalize.py requires the Cython extension to be compiled. It does not provide pure Python fallbacks for core Cython functions. For systems without a C++ compiler, install via a pre-built wheel.
Pattern B: try/except with fallback — used by edit_distance.py and syllable_rules.py:
- Modules using this pattern work without Cython compilation
- Tests run without compilation for those modules
- Gradual migration path
normalize.py (Pattern A) is a core dependency used throughout the library, the package effectively requires Cython extensions to be compiled. Install via pre-built wheels on systems without a C++ compiler.
Checking Active Implementation
2. Declaration Files (.pxd)
.pxd files declare C-level function signatures for cross-module imports:
3. OpenMP Parallel Processing
Onlybatch_processor.pyx uses OpenMP for parallelization:
4. C++ Integration
All extensions uselanguage="c++" for STL containers:
Extension Details
Text Normalization (normalize_c.pyx)
unordered_set for O(1) character lookups with pre-compiled character sets for Myanmar ranges.
Edit Distance (edit_distance_c.pyx)
Syllable Validation (syllable_rules_c.pyx)
Batch Processor (batch_processor.pyx)
parallel for directives for multi-threading with GIL-free C++ processing.
Viterbi POS Tagger (viterbi.pyx)
Performance Comparison
Benchmark results (10,000 iterations):| Operation | Pure Python | Cython | Speedup |
|---|---|---|---|
| Levenshtein (short) | 100μs | 10μs | 10x |
| Levenshtein (long) | 1ms | 50μs | 20x |
| Text normalization | 50μs | 5μs | 10x |
| Syllable validation | 80μs | 10μs | 8x |
| Batch (1000 texts) | 5s | 0.5s | 10x |
Testing Cython Code
Running Tests
Testing Both Implementations
Tests should verify Python/Cython consistency:Adding New Cython Modules
-
Create
.pyxfile: -
Create
.pxdfile (if needed for cross-module imports): -
Add to
setup.py: -
Create Python wrapper:
-
Add tests and rebuild:
Troubleshooting
”Cannot find Cython” during build
”libomp not found” on macOS
Module import fails after changes
Compiler errors (C++ standard)
Segmentation fault in Cython code
Common causes: releasing GIL while accessing Python objects, buffer overflow in typed memoryviews, use-after-free in C++ containers.Best Practices
- Always provide Python fallback - graceful degradation pattern
- Use type declarations - fully typed
cdeffunctions for speed - Minimize GIL releases - only release when safe (pure C/C++ operations)
- Use memory views for arrays - typed
double[:]for efficient array access - Document Cython-specific behavior - note which implementation is active
See Also
- Performance Tuning - Overall optimization strategies
- Edit Distance Algorithms - Algorithm details
- Development Setup - Environment configuration
- Cython Documentation