Cython Development Guide

Learn how to build, test, debug, and contribute Cython extensions, including OpenMP parallel processing and the Python fallback pattern.

Overview

mySpellChecker uses Cython for performance-critical operations:

Module	Purpose	Performance Gain
`batch_processor.pyx`	Parallel batch processing	5-10x
`frequency_counter.pyx`	Fast frequency calculations	3-5x
`normalize_c.pyx`	Text normalization	2-3x
`edit_distance_c.pyx`	Levenshtein distance	4-6x
`viterbi.pyx`	Viterbi algorithm for POS	3-4x
`word_segment.pyx`	Word segmentation	2-3x
`mmap_reader.pyx`	Memory-mapped file reading	2-4x
`syllable_rules_c.pyx`	Syllable validation rules	2-3x
`ingester_c.pyx`	Corpus ingestion	2-3x
`repair_c.pyx`	Segmentation repair	2-3x
`tsv_reader_c.pyx`	TSV file reading	2-3x

Prerequisites

Required Tools

# macOS
brew install python cython libomp

# Ubuntu/Debian
sudo apt-get install python3-dev cython3 libomp-dev

# Windows (with Visual Studio)
pip install cython

Verify Installation

# Check Cython version
cython --version

# Check C++ compiler
c++ --version  # macOS/Linux
cl             # Windows

Project Structure

src/myspellchecker/
├── text/
│   ├── normalize.py         # Python wrapper (public API)
│   ├── normalize_c.pyx      # Cython implementation
│   └── normalize_c.pxd      # C-level declarations (header)
├── algorithms/
│   ├── viterbi.py           # Python wrapper
│   ├── viterbi.pyx          # Cython implementation
│   └── distance/
│       └── edit_distance_c.pyx  # Edit distance (Cython)
└── data_pipeline/
    ├── batch_processor.py   # Python wrapper
    └── batch_processor.pyx  # Cython implementation (uses OpenMP)

File Types

Extension	Purpose	Git Tracked?
`.pyx`	Cython source code	Yes
`.pxd`	C-level declarations (like C headers)	Yes
`.py`	Python wrapper/fallback	Yes
`.cpp`	Generated C++ code	No
`.so` / `.pyd`	Compiled binary	No

Building Cython Extensions

Development Build

# Rebuild after modifying .pyx files
python setup.py build_ext --inplace

# Clean and rebuild
python setup.py clean --all
python setup.py build_ext --inplace

Build with Debug Symbols

# For debugging with gdb/lldb
CFLAGS="-g -O0" python setup.py build_ext --inplace

Build Options

The setup.py automatically detects:

OpenMP availability (macOS requires brew install libomp)
C++ compiler capabilities
Platform-specific flags

Writing Cython Code

Basic Pattern

# normalize_c.pyx
from cpython cimport PyUnicode_AsUTF8String
from libc.string cimport memcpy, strlen

cdef class Normalizer:
    """Cython normalizer for Myanmar text."""

    cdef str _text

    def __init__(self, str text):
        self._text = text

    cpdef str normalize(self):
        """Public method callable from Python."""
        return self._normalize_impl()

    cdef str _normalize_impl(self):
        """Private C-level method (faster, not callable from Python)."""
        # Implementation here
        pass

Creating .pxd Files

# normalize_c.pxd
cdef class Normalizer:
    cdef str _text
    cpdef str normalize(self)
    cdef str _normalize_impl(self)

Cross-Module Imports

# corpus_segmenter.pyx
from myspellchecker.text.normalize_c cimport Normalizer

cdef class CorpusSegmenter:
    cdef Normalizer normalizer

    def __init__(self):
        self.normalizer = Normalizer("")

Import Pattern

The core normalize.py module imports directly from the Cython extension without pure Python fallbacks:

# normalize.py - direct Cython imports (no fallback)
from myspellchecker.text.normalize_c import (
    get_myanmar_ratio as c_get_myanmar_ratio,
)
from myspellchecker.text.normalize_c import (
    remove_zero_width_chars as c_remove_zero_width,
)
from myspellchecker.text.normalize_c import (
    reorder_myanmar_diacritics as c_reorder_diacritics,
)

def normalize(text: str, form: str = "NFC", ...) -> str:
    """Main normalization function (public API)."""
    text = c_remove_zero_width(text)
    text = c_reorder_diacritics(text)
    # ... more normalization steps
    return text

Note: Unlike some other modules that use try/except ImportError fallbacks, normalize.py requires the Cython extension. For systems without a C++ compiler, install from a pre-built wheel.

OpenMP Integration

For parallel processing (used in batch_processor.pyx):

# corpus_segmenter.pyx
from cython.parallel cimport prange, parallel
from openmp cimport omp_get_num_threads, omp_set_num_threads

cdef class CorpusSegmenter:
    cdef int num_threads

    def __init__(self, int num_threads=0):
        if num_threads <= 0:
            self.num_threads = omp_get_num_threads()
        else:
            self.num_threads = num_threads

    cpdef list process_batch(self, list items):
        cdef int n = len(items)
        cdef int i
        results = [None] * n

        with nogil, parallel(num_threads=self.num_threads):
            for i in prange(n, schedule='dynamic'):
                with gil:
                    results[i] = self._process_item(items[i])

        return results

macOS OpenMP Setup

# Install OpenMP
brew install libomp

# Set environment variables (add to ~/.zshrc)
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"

Testing Cython Code

Unit Tests

# tests/test_normalize_c.py
import pytest
from myspellchecker.text.normalize import normalize
from myspellchecker.text.normalize_c import remove_zero_width_chars

class TestNormalize:
    def test_zero_width_removal(self):
        result = remove_zero_width_chars("test\u200btext")
        assert result == "testtext"

    def test_basic_normalization(self):
        result = normalize("မြန်မာ")
        assert isinstance(result, str)

    def test_cython_available(self):
        """Check if Cython extension is loaded."""
        try:
            from myspellchecker.text.normalize_c import remove_zero_width_chars as c_func
            cython_available = True
        except ImportError:
            cython_available = False
        # Test passes either way - just informational
        print(f"Cython normalize: {cython_available}")

    def test_performance(self):
        """Test normalization performance."""
        import time
        text = "test" * 1000

        start = time.perf_counter()
        for _ in range(1000):
            remove_zero_width_chars(text)
        elapsed = time.perf_counter() - start

        assert elapsed < 1.0  # Adjust threshold as needed

Benchmark Tests

# Run benchmarks
pytest tests/ -k benchmark --benchmark-only

Debugging

Print Debugging

# In .pyx file
from libc.stdio cimport printf

cdef void debug_print(const char* msg) nogil:
    printf("DEBUG: %s\n", msg)

GDB/LLDB

# Build with debug symbols
CFLAGS="-g -O0" python setup.py build_ext --inplace

# Debug with lldb (macOS)
lldb python
(lldb) run -c "from myspellchecker.text.normalize_c import remove_zero_width_chars"

# Debug with gdb (Linux)
gdb python
(gdb) run -c "from myspellchecker.text.normalize_c import remove_zero_width_chars"

Memory Profiling

# Check for memory leaks
valgrind --leak-check=full python -c "
from myspellchecker.text.normalize_c import remove_zero_width_chars
for _ in range(10000):
    remove_zero_width_chars('test')
"

Common Pitfalls

1. Forgetting to Rebuild

After modifying .pyx files, always rebuild:

python setup.py build_ext --inplace

2. GIL Management

# Wrong - will deadlock
cdef void process() nogil:
    # Python operations require GIL
    result = some_python_function()  # Error!

# Correct
cdef void process() nogil:
    with gil:
        result = some_python_function()

3. Memory Management

# Wrong - memory leak
cdef char* create_string():
    cdef char* s = <char*>malloc(100)
    return s  # Caller must free!

# Better - use Python strings
cdef str create_string():
    return "result"  # Python handles memory

4. Type Declarations

# Slow - Python object
def slow_function(x):
    return x * 2

# Fast - typed
cpdef int fast_function(int x):
    return x * 2

Performance Tips

Use cdef for internal functions - Not callable from Python, but faster
Use typed memoryviews - Faster than NumPy arrays in loops
Minimize GIL acquisition - Use nogil where possible
Use cpdef for hybrid - Callable from Python and fast from Cython
Profile before optimizing - Use cython -a to see Python interactions

Annotation Output

# Generate HTML with Python interaction highlighting
cython -a normalize_c.pyx

# Yellow lines indicate Python interactions (slow)
# White lines are pure C (fast)

Contributing

When contributing Cython code:

Include both .pyx and .py wrapper
Add .pxd file if cross-module imports needed
Write tests that work with both backends
Document performance characteristics
Test on multiple platforms if possible

Setup & Contributing

Testing & Cython

CLI Internals

Overview

Prerequisites

Required Tools

Verify Installation

Project Structure

File Types

Building Cython Extensions

Development Build

Build with Debug Symbols

Build Options

Writing Cython Code

Basic Pattern

Creating .pxd Files

Cross-Module Imports

Import Pattern

OpenMP Integration

macOS OpenMP Setup

Testing Cython Code

Unit Tests

Benchmark Tests

Debugging

Print Debugging

GDB/LLDB

Memory Profiling

Common Pitfalls

1. Forgetting to Rebuild

2. GIL Management

3. Memory Management

4. Type Declarations

Performance Tips

Annotation Output

Contributing

Setup & Contributing

Testing & Cython

CLI Internals

​Overview

​Prerequisites

​Required Tools

​Verify Installation

​Project Structure

​File Types

​Building Cython Extensions

​Development Build

​Build with Debug Symbols

​Build Options

​Writing Cython Code

​Basic Pattern

​Creating .pxd Files

​Cross-Module Imports

​Import Pattern

​OpenMP Integration

​macOS OpenMP Setup

​Testing Cython Code

​Unit Tests

​Benchmark Tests

​Debugging

​Print Debugging

​GDB/LLDB

​Memory Profiling

​Common Pitfalls

​1. Forgetting to Rebuild

​2. GIL Management

​3. Memory Management

​4. Type Declarations

​Performance Tips

​Annotation Output

​Contributing

Overview

Prerequisites

Required Tools

Verify Installation

Project Structure

File Types

Building Cython Extensions

Development Build

Build with Debug Symbols

Build Options

Writing Cython Code

Basic Pattern

Creating .pxd Files

Cross-Module Imports

Import Pattern

OpenMP Integration

macOS OpenMP Setup

Testing Cython Code

Unit Tests

Benchmark Tests

Debugging

Print Debugging

GDB/LLDB

Memory Profiling

Common Pitfalls

1. Forgetting to Rebuild

2. GIL Management

3. Memory Management

4. Type Declarations

Performance Tips

Annotation Output

Contributing