Documentation Index
Fetch the complete documentation index at: https://docs.myspellchecker.com/llms.txt
Use this file to discover all available pages before exploring further.
Word segmentation (myword and CRF engines) requires binary resources, specifically a dictionary file and a CRF model, that are too large to bundle with the pip package. They download from HuggingFace on first use and are cached locally. This page explains how the cache works, how to pre-populate it for offline or airgapped deployments, and how to point it at a custom directory.
Overview
from myspellchecker.tokenizers.resource_loader import (
get_resource_path,
get_segmentation_mmap_path,
get_crf_model_path,
get_curated_lexicon_path,
)
# Get path to resource (auto-downloads if needed)
mmap_path = get_segmentation_mmap_path()
crf_path = get_crf_model_path()
lexicon_path = get_curated_lexicon_path()
Available Resources
| Resource | Description | Size | Use Case |
|---|
segmentation | Word segmentation dictionary (mmap) | ~50MB | Word tokenization |
crf | CRF model for word segmentation | ~10MB | Word tokenization |
curated_lexicon | Curated lexicon CSV | ~1MB | Dictionary enrichment |
Resource Resolution
Resources are resolved in this order:
- Local bundled path: Check if package includes the file
- Cache directory: Check previously downloaded files
- Download: Fetch from HuggingFace on first use
def get_resource_path(
name: str,
cache_dir: Optional[Path] = None,
force_download: bool = False,
) -> Path:
"""Get path to a resource, downloading if necessary.
Args:
name: Resource name ("segmentation", "crf")
cache_dir: Custom cache directory
force_download: Skip bundled resources (does NOT force
re-download of cached resources — delete the cache
directory manually to force a fresh download)
Returns:
Path to the resource file
Raises:
ValueError: If resource name is unknown
RuntimeError: If download fails
"""
Core Functions
get_resource_path
General-purpose resource getter:
from myspellchecker.tokenizers.resource_loader import get_resource_path
# Get segmentation dictionary
path = get_resource_path("segmentation")
# Get CRF model
path = get_resource_path("crf")
# Skip bundled resources, use cached or download
path = get_resource_path("segmentation", force_download=True)
# Custom cache directory
path = get_resource_path("crf", cache_dir=Path("/custom/cache"))
Convenience Functions
Specific getters for each resource:
from myspellchecker.tokenizers.resource_loader import (
get_segmentation_mmap_path,
get_crf_model_path,
get_curated_lexicon_path,
)
# Word segmentation dictionary
mmap_path = get_segmentation_mmap_path()
# CRF syllable model
crf_path = get_crf_model_path()
# Curated lexicon
lexicon_path = get_curated_lexicon_path()
Downloading All Resources
To pre-download all resources, call each getter:
from myspellchecker.tokenizers.resource_loader import (
get_segmentation_mmap_path,
get_crf_model_path,
get_curated_lexicon_path,
)
# Download all resources (auto-caches)
mmap_path = get_segmentation_mmap_path()
crf_path = get_crf_model_path()
lexicon_path = get_curated_lexicon_path()
# Skip bundled resources, use cached or download
mmap_path = get_segmentation_mmap_path(force_download=True)
crf_path = get_crf_model_path(force_download=True)
lexicon_path = get_curated_lexicon_path(force_download=True)
Cache Locations
Default Cache Directory
DEFAULT_CACHE_DIR = Path.home() / ".cache" / "myspellchecker" / "resources"
# Example: /home/user/.cache/myspellchecker/resources/
Local Bundled Paths
Resources can be bundled with the package:
If files exist locally, they’re used instead of downloading.
HuggingFace Integration
Resources are hosted on HuggingFace Datasets:
RESOURCE_VERSION = "main"
HF_REPO = f"https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve/{RESOURCE_VERSION}"
RESOURCE_URLS = {
"segmentation": f"{HF_REPO}/segmentation/segmentation.mmap",
"crf": f"{HF_REPO}/models/wordseg_c2_crf.crfsuite",
"curated_lexicon": f"{HF_REPO}/curated_lexicon/curated_lexicon.csv",
}
Download Behavior
First Use
# First time: Downloads and caches
path = get_segmentation_mmap_path()
# Output:
# INFO: Downloading segmentation resource (first time only)...
# INFO: Download complete: ~/.cache/myspellchecker/resources/segmentation.mmap
Subsequent Uses
# Subsequent calls: Uses cache (silent)
path = get_segmentation_mmap_path()
# No output - uses cached file
Skip Bundled Resources
# Skip bundled resources, use cached or download from HuggingFace
path = get_segmentation_mmap_path(force_download=True)
# Note: This skips the package-bundled resource check, but does NOT
# re-download if the resource is already cached. To force a fresh
# download, delete the cache directory first.
Error Handling
Unknown Resource
try:
path = get_resource_path("unknown_resource")
except ValueError as e:
print(e) # "Unknown resource: unknown_resource. Available: ['segmentation', 'crf']"
Download Failure
try:
path = get_resource_path("segmentation")
except RuntimeError as e:
print(f"Download failed: {e}")
Network Issues
The loader handles network failures gracefully:
from myspellchecker.tokenizers.resource_loader import get_resource_path
try:
path = get_resource_path("segmentation")
except Exception as e:
# Fall back to bundled resource if available
local_path = Path("data/models/segmentation.mmap")
if local_path.exists():
path = local_path
else:
raise
Integration Examples
With DefaultSegmenter
from myspellchecker.segmenters import DefaultSegmenter
# Resources are loaded automatically with default word engine
segmenter = DefaultSegmenter(word_engine="myword")
# Alternative engines
segmenter_crf = DefaultSegmenter(word_engine="crf")
With CRF Tokenizer
from myspellchecker.tokenizers.resource_loader import get_crf_model_path
# Get CRF model path (automatically cached)
model_path = get_crf_model_path()
print(f"CRF model cached at: {model_path}")
# Note: CRF model is used internally by segmenters when word_engine='crf'
# See configuration documentation for usage
Offline Mode
Pre-download resources for offline use:
from myspellchecker.tokenizers.resource_loader import get_resource_path
# Download to default cache during installation/setup
get_resource_path("segmentation")
get_resource_path("crf")
# Or download to specific directory for deployment
from pathlib import Path
get_resource_path("segmentation", cache_dir=Path("/app/resources"))
get_resource_path("crf", cache_dir=Path("/app/resources"))
Configuration
Environment Variables
| Variable | Description | Default |
|---|
MYSPELL_CACHE_DIR | Custom cache directory | ~/.cache/myspellchecker/resources |
MYSPELL_OFFLINE | Disable downloads | false |
Custom Cache Directory
import os
from pathlib import Path
from myspellchecker.tokenizers.resource_loader import get_resource_path
# Via environment variable
os.environ["MYSPELL_CACHE_DIR"] = "/custom/path"
# Or via function parameter
path = get_resource_path("segmentation", cache_dir=Path("/custom/path"))
ResourceConfig
The ResourceConfig class provides a Pydantic model for configuring resource download and caching behavior. It controls the HuggingFace repository URL, version tag, and local cache directory.
from myspellchecker.core.config import ResourceConfig
resource_config = ResourceConfig(
resource_version="main", # HuggingFace version tag (bump for reproducibility)
hf_repo_base="https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve",
cache_dir=None, # None = ~/.cache/myspellchecker/resources
)
| Field | Type | Default | Description |
|---|
resource_version | str | "main" | Resource version tag on HuggingFace. Bump with releases for reproducibility. |
hf_repo_base | str | "https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve" | Base URL for the HuggingFace dataset repository (without version suffix). |
cache_dir | str | None | None | Local cache directory. Defaults to ~/.cache/myspellchecker/resources. Can be overridden with MYSPELL_CACHE_DIR env var. |
Airgapped / Offline Deployments
For environments without internet access, pre-download resources on a connected machine and point cache_dir to the local path:
from myspellchecker.core.config import ResourceConfig
# Point to pre-downloaded resources (no HuggingFace calls)
resource_config = ResourceConfig(
cache_dir="/app/data/resources",
)
# Or pin a specific version for reproducible builds
resource_config = ResourceConfig(
resource_version="v1.0.0",
cache_dir="/app/data/resources",
)
Alternatively, set the environment variable to redirect all resource lookups:
export MYSPELL_CACHE_DIR=/app/data/resources
export MYSPELL_OFFLINE=true
Best Practices
1. Pre-download in Production
# In deployment script
from myspellchecker.tokenizers.resource_loader import get_resource_path
# Download all resources before starting application
seg_path = get_resource_path("segmentation")
crf_path = get_resource_path("crf")
print(f"Resources ready: segmentation={seg_path}, crf={crf_path}")
2. Use Custom Cache for Containers
from pathlib import Path
from myspellchecker.tokenizers.resource_loader import get_resource_path
# In Docker, use a persistent volume
get_resource_path("segmentation", cache_dir=Path("/app/data/resources"))
get_resource_path("crf", cache_dir=Path("/app/data/resources"))
3. Handle Network Failures
import logging
from myspellchecker.tokenizers.resource_loader import get_resource_path
try:
path = get_resource_path("segmentation")
except Exception as e:
logging.warning(f"Failed to download resource: {e}")
# Use fallback or raise user-friendly error
4. Update Cached Resources
# force_download=True only skips bundled resources, it does NOT
# re-download cached files. To get fresh copies from HuggingFace,
# delete the cache directory first, then re-download:
import shutil
from pathlib import Path
cache_dir = Path.home() / ".cache" / "myspellchecker" / "resources"
shutil.rmtree(cache_dir, ignore_errors=True)
from myspellchecker.tokenizers.resource_loader import get_resource_path
get_resource_path("segmentation")
get_resource_path("crf")
See Also