Overview
Available Resources
| Resource | Description | Size | Use Case |
|---|---|---|---|
segmentation | Word segmentation dictionary (mmap) | ~50MB | Word tokenization |
crf | CRF model for word segmentation | ~10MB | Word tokenization |
curated_lexicon | Curated lexicon CSV | ~1MB | Dictionary enrichment |
Resource Resolution
Resources are resolved in this order:- Local bundled path: Check if package includes the file
- Cache directory: Check previously downloaded files
- Download: Fetch from HuggingFace on first use
Core Functions
get_resource_path
General-purpose resource getter:Convenience Functions
Specific getters for each resource:Downloading All Resources
To pre-download all resources, call each getter:Cache Locations
Default Cache Directory
Local Bundled Paths
Resources can be bundled with the package:myspellchecker
data
models
segmentation.mmap
wordseg_c2_crf.crfsuite
HuggingFace Integration
Resources are hosted on HuggingFace Datasets:Download Behavior
First Use
Subsequent Uses
Skip Bundled Resources
Error Handling
Unknown Resource
Download Failure
Network Issues
The loader handles network failures gracefully:Integration Examples
With DefaultSegmenter
With CRF Tokenizer
Offline Mode
Pre-download resources for offline use:Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
MYSPELL_CACHE_DIR | Custom cache directory | ~/.cache/myspellchecker/resources |
MYSPELL_OFFLINE | Disable downloads | false |
Custom Cache Directory
ResourceConfig
TheResourceConfig class provides a Pydantic model for configuring resource download and caching behavior. It controls the HuggingFace repository URL, version tag, and local cache directory.
| Field | Type | Default | Description |
|---|---|---|---|
resource_version | str | "main" | Resource version tag on HuggingFace. Bump with releases for reproducibility. |
hf_repo_base | str | "https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve" | Base URL for the HuggingFace dataset repository (without version suffix). |
cache_dir | str | None | None | Local cache directory. Defaults to ~/.cache/myspellchecker/resources. Can be overridden with MYSPELL_CACHE_DIR env var. |
Airgapped / Offline Deployments
For environments without internet access, pre-download resources on a connected machine and pointcache_dir to the local path:
Best Practices
1. Pre-download in Production
2. Use Custom Cache for Containers
3. Handle Network Failures
4. Update Cached Resources
See Also
- Segmenters - Text segmentation using resources
- Configuration Guide - General configuration
- Data Pipeline - Building dictionaries
- Installation Guide - Setup instructions