Skip to main content
All 11 Cython extensions are listed here with their architecture patterns, build instructions, performance benchmarks, and troubleshooting steps.

Overview

mySpellChecker uses Cython to compile performance-critical Python code to C++ extensions. The project contains 11 Cython extensions:
ExtensionLocationPurposeSpeedupOpenMP
normalize_c.pyxtext/Text normalization, zero-width char removal~10xNo
edit_distance_c.pyxalgorithms/distance/Levenshtein, Damerau-Levenshtein distance~10xNo
viterbi.pyxalgorithms/Viterbi POS tagger~5xNo
syllable_rules_c.pyxcore/Syllable structure validation~8xNo
batch_processor.pyxdata_pipeline/Parallel batch processing~4xYes
frequency_counter.pyxdata_pipeline/Fast frequency counting~6xNo
word_segment.pyxtokenizers/cython/Word segmentation~5xNo
mmap_reader.pyxtokenizers/cython/Memory-mapped file reading~3xNo
ingester_c.pyxdata_pipeline/Fast corpus ingestion~4xNo
repair_c.pyxdata_pipeline/Segmentation repair~5xNo
tsv_reader_c.pyxdata_pipeline/Fast TSV file parsing~4xNo

Building Extensions

Quick Start

Extensions are automatically built during installation:
# Install with development dependencies (builds extensions)
pip install -e ".[dev]"

# Or explicitly build extensions
python setup.py build_ext --inplace

Requirements

All Platforms:
  • Python 3.10+
  • Cython 3.0+
  • C++ compiler (gcc 9+, clang 10+, or MSVC 2019+)
macOS (for OpenMP support):
# Install libomp for parallel processing
brew install libomp

# Apple Silicon (M1/M2/M3)
# libomp installed to: /opt/homebrew/opt/libomp

# Intel Mac
# libomp installed to: /usr/local/opt/libomp
Linux:
# OpenMP comes with gcc
sudo apt install build-essential  # Debian/Ubuntu
sudo yum groupinstall "Development Tools"  # RHEL/CentOS

Build Outputs

File TypeExamplePurposeGit Tracked
.pyxnormalize_c.pyxCython sourceYes
.pxdnormalize_c.pxdC-level declarationsYes
.pynormalize.pyPython wrapper (some modules include fallback)Yes
.cppnormalize_c.cppGenerated C++ (build artifact)No
.sonormalize_c.*.soCompiled binaryNo

Debug Build

# Build with debug symbols
python setup.py build_ext --inplace --debug

Clean Build

# Remove compiled extensions
find . -name "*.so" -delete
find . -name "*.cpp" -path "*/myspellchecker/*" -delete
find . -name "*.pyc" -delete
rm -rf build/ *.egg-info/

# Rebuild from scratch
python setup.py build_ext --inplace

Architecture Patterns

1. Wrapper Pattern (Cython with Fallback)

Some Cython modules have Python wrappers that provide fallback when Cython isn’t available, but not all. There are two patterns in use: Pattern A: Hard import (no fallback) — used by normalize.py:
# normalize.py -- hard import, NO try/except fallback
from myspellchecker.text.normalize_c import (
    remove_zero_width_chars as c_remove_zero_width,
)
from myspellchecker.text.normalize_c import (
    reorder_myanmar_diacritics as c_reorder_diacritics,
)
normalize.py requires the Cython extension to be compiled. It does not provide pure Python fallbacks for core Cython functions. For systems without a C++ compiler, install via a pre-built wheel. Pattern B: try/except with fallback — used by edit_distance.py and syllable_rules.py:
# edit_distance.py -- try/except with full Python fallback
try:
    from myspellchecker.algorithms.distance import edit_distance_c
    _HAS_CYTHON_EDIT_DISTANCE = True
except ImportError:
    _HAS_CYTHON_EDIT_DISTANCE = False

# Each function checks the flag and falls back to pure Python:
def levenshtein_distance(s1: str, s2: str) -> int:
    if _HAS_CYTHON_EDIT_DISTANCE:
        return edit_distance_c.levenshtein_distance(s1, s2)
    # ... pure Python implementation follows ...
# syllable_rules.py -- try/except with class-level fallback
try:
    from myspellchecker.core.syllable_rules_c import (
        SyllableRuleValidator as _SyllableRuleValidatorCython,
    )
    SyllableRuleValidator = _SyllableRuleValidatorCython
except ImportError:
    SyllableRuleValidator = _SyllableRuleValidatorPython
Benefits of Pattern B (where it exists):
  • Modules using this pattern work without Cython compilation
  • Tests run without compilation for those modules
  • Gradual migration path
Important: Since normalize.py (Pattern A) is a core dependency used throughout the library, the package effectively requires Cython extensions to be compiled. Install via pre-built wheels on systems without a C++ compiler.

Checking Active Implementation

# Check if Cython extension loaded
try:
    from myspellchecker.text.normalize_c import remove_zero_width_chars
    print("Cython normalize: loaded")
except ImportError:
    print("Cython normalize: not available")
    # NOTE: There is NO Python fallback for normalize. The normalize.py
    # wrapper uses hard imports (Pattern A), so if the Cython extension
    # is not compiled, importing normalize.py will raise ImportError.

from myspellchecker.algorithms.viterbi import _HAS_CYTHON_VITERBI
print(f"Cython viterbi: {_HAS_CYTHON_VITERBI}")

2. Declaration Files (.pxd)

.pxd files declare C-level function signatures for cross-module imports:
# normalize_c.pxd
cdef str c_remove_zero_width_chars(str text)
cdef bint c_has_myanmar_script(str text)
# batch_processor.pyx (using the declarations)
from myspellchecker.text.normalize_c cimport c_remove_zero_width_chars

def process_batch(list texts):
    cdef str text
    for text in texts:
        # Direct C-level call (no Python overhead)
        text = c_remove_zero_width_chars(text)

3. OpenMP Parallel Processing

Only batch_processor.pyx uses OpenMP for parallelization:
# batch_processor.pyx
from cython.parallel import prange

def process_batch_parallel(list texts, int num_threads=4):
    cdef int i, n = len(texts)
    cdef list results = [None] * n

    # Parallel loop with OpenMP
    for i in prange(n, nogil=True, num_threads=num_threads):
        with gil:
            results[i] = process_single(texts[i])

    return results
Note: OpenMP is optional. If libomp isn’t installed, the library falls back to single-threaded processing.

4. C++ Integration

All extensions use language="c++" for STL containers:
# distutils: language = c++

from libcpp.string cimport string
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map

cdef class EditDistanceCalculator:
    cdef unordered_map[string, int] cache

    cdef int calculate(self, string s1, string s2) nogil:
        # C++ implementation with STL
        pass

Extension Details

Text Normalization (normalize_c.pyx)

from myspellchecker.text.normalize import (
    remove_zero_width_chars,
    normalize,
    normalize_for_lookup,
)

clean = remove_zero_width_chars("မြန်\u200bမာ")  # "မြန်မာ"
normalized = normalize("မြန်မာ")
lookup_form = normalize_for_lookup("မြန်မာ")
Uses C++ unordered_set for O(1) character lookups with pre-compiled character sets for Myanmar ranges.

Edit Distance (edit_distance_c.pyx)

from myspellchecker.algorithms.distance.edit_distance_c import (
    levenshtein_distance,
    damerau_levenshtein_distance,
    weighted_damerau_levenshtein_distance,
    set_myanmar_substitution_costs,
)

dist = levenshtein_distance("မြန်", "မြမ်")  # 1
dist = damerau_levenshtein_distance("AB", "BA")  # 1 (includes transposition)
Row-based DP for O(min(m,n)) space complexity with proper UTF-8 handling for Myanmar’s 3-byte characters.

Syllable Validation (syllable_rules_c.pyx)

from myspellchecker.core.syllable_rules import SyllableRuleValidator

validator = SyllableRuleValidator(strict=True)
is_valid = validator.validate("မြန်")  # True
is_valid = validator.validate("ြမန်")  # False (medial without consonant)
22+ validation checks per syllable with pre-computed character sets.

Batch Processor (batch_processor.pyx)

from myspellchecker.data_pipeline.batch_processor import BatchProcessor

processor = BatchProcessor(num_workers=4)
results = processor.process_batch(texts, callback=process_fn)
OpenMP parallel for directives for multi-threading with GIL-free C++ processing.

Viterbi POS Tagger (viterbi.pyx)

from myspellchecker.algorithms.viterbi import ViterbiTagger

tagger = ViterbiTagger()
tags = tagger.tag(["သူ", "က", "သွား", "တယ်"])
# ["N", "P_SUBJ", "V", "P_SENT"]
Log-space computation for numerical stability with optimized backtracking.

Performance Comparison

Benchmark results (10,000 iterations):
OperationPure PythonCythonSpeedup
Levenshtein (short)100μs10μs10x
Levenshtein (long)1ms50μs20x
Text normalization50μs5μs10x
Syllable validation80μs10μs8x
Batch (1000 texts)5s0.5s10x

Testing Cython Code

Running Tests

# Run tests for Cython modules
pytest tests/test_normalize_c.py tests/test_edit_distance_c.py -v

# Run with coverage (tests both Python and Cython paths)
pytest tests/ --cov=myspellchecker --cov-report=html

Testing Both Implementations

Tests should verify Python/Cython consistency:
# test_normalize_c.py
import pytest
from myspellchecker.text.normalize_c import remove_zero_width_chars

def test_remove_zero_width_basic():
    text = "hello\u200bworld"
    result = remove_zero_width_chars(text)
    assert result == "helloworld"

@pytest.mark.skipif(not CYTHON_AVAILABLE, reason="Cython not compiled")
def test_cython_zero_width_edge_cases():
    from myspellchecker.text.normalize_c import remove_zero_width_chars

    test_cases = [
        ("hello\u200bworld", "helloworld"),   # ZWSP
        ("မြန်မာ\u200b", "မြန်မာ"),          # Trailing ZWSP
        ("", ""),                              # Empty string
    ]
    for text, expected in test_cases:
        assert remove_zero_width_chars(text) == expected

Adding New Cython Modules

  1. Create .pyx file:
    src/myspellchecker/new_module/fast_impl.pyx
    
  2. Create .pxd file (if needed for cross-module imports):
    src/myspellchecker/new_module/fast_impl.pxd
    
  3. Add to setup.py:
    Extension(
        name="myspellchecker.new_module.fast_impl",
        sources=["src/myspellchecker/new_module/fast_impl.pyx"],
        language="c++",
    ),
    
  4. Create Python wrapper:
    # new_module/__init__.py
    try:
        from .fast_impl import function
    except ImportError:
        def function(...):
            # Fallback
            pass
    
  5. Add tests and rebuild:
    python setup.py build_ext --inplace
    pytest tests/test_fast_impl.py
    

Troubleshooting

”Cannot find Cython” during build

pip install cython>=3.0

”libomp not found” on macOS

brew install libomp

# If build still fails:
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
python setup.py build_ext --inplace

Module import fails after changes

python setup.py build_ext --inplace --force

Compiler errors (C++ standard)

export CXXFLAGS="-std=c++11"
python setup.py build_ext --inplace

Segmentation fault in Cython code

Common causes: releasing GIL while accessing Python objects, buffer overflow in typed memoryviews, use-after-free in C++ containers.
# Build with debug symbols and debug with lldb
python setup.py build_ext --inplace --debug
lldb python -c "import myspellchecker"

Best Practices

  1. Always provide Python fallback - graceful degradation pattern
  2. Use type declarations - fully typed cdef functions for speed
  3. Minimize GIL releases - only release when safe (pure C/C++ operations)
  4. Use memory views for arrays - typed double[:] for efficient array access
  5. Document Cython-specific behavior - note which implementation is active

See Also