Resource Caching - mySpellChecker

The myword and CRF word segmentation engines require binary resources (dictionary and model files) that are hosted on HuggingFace and downloaded automatically on first use. This page covers cache management, offline mode, and custom cache directories.

Overview

from myspellchecker.tokenizers.resource_loader import (
    get_resource_path,
    get_segmentation_mmap_path,
    get_crf_model_path,
    ensure_resources_available,
)

# Get path to resource (auto-downloads if needed)
mmap_path = get_segmentation_mmap_path()
crf_path = get_crf_model_path()

Available Resources

Resource	Description	Size	Use Case
`segmentation`	Word segmentation dictionary (mmap)	~50MB	Word tokenization
`crf`	CRF model for word segmentation	~10MB	Word tokenization

Resource Resolution

Resources are resolved in this order:

Local bundled path: Check if package includes the file
Cache directory: Check previously downloaded files
Download: Fetch from HuggingFace on first use

def get_resource_path(
    name: str,
    cache_dir: Optional[Path] = None,
    force_download: bool = False,
) -> Path:
    """Get path to a resource, downloading if necessary.

    Args:
        name: Resource name ("segmentation", "crf")
        cache_dir: Custom cache directory
        force_download: Force re-download even if cached

    Returns:
        Path to the resource file

    Raises:
        ValueError: If resource name is unknown
        RuntimeError: If download fails
    """

Core Functions

get_resource_path

General-purpose resource getter:

from myspellchecker.tokenizers.resource_loader import get_resource_path

# Get segmentation dictionary
path = get_resource_path("segmentation")

# Get CRF model
path = get_resource_path("crf")

# Force re-download
path = get_resource_path("segmentation", force_download=True)

# Custom cache directory
path = get_resource_path("crf", cache_dir=Path("/custom/cache"))

Convenience Functions

Specific getters for each resource:

from myspellchecker.tokenizers.resource_loader import (
    get_segmentation_mmap_path,
    get_crf_model_path,
)

# Word segmentation dictionary
mmap_path = get_segmentation_mmap_path()

# CRF syllable model
crf_path = get_crf_model_path()

ensure_resources_available

Download all resources at once:

from myspellchecker.tokenizers.resource_loader import ensure_resources_available

# Download all missing resources
paths = ensure_resources_available()
print(paths)
# {
#     "segmentation": Path("~/.cache/myspellchecker/resources/segmentation.mmap"),
#     "crf": Path("~/.cache/myspellchecker/resources/wordseg_c2_crf.crfsuite"),
# }

# Force re-download all
paths = ensure_resources_available(force_download=True)

clear_cache

Clear the cache directory:

from myspellchecker.tokenizers.resource_loader import clear_cache

# Clear default cache
clear_cache()

# Clear custom cache
clear_cache(cache_dir=Path("/custom/cache"))

Cache Locations

Default Cache Directory

DEFAULT_CACHE_DIR = Path.home() / ".cache" / "myspellchecker" / "resources"
# Example: /home/user/.cache/myspellchecker/resources/

Local Bundled Paths

Resources can be bundled with the package:

# Project structure
myspellchecker/
├── data/
│   └── models/
│       ├── segmentation.mmap    # Bundled segmentation dict
│       └── wordseg_c2_crf.crfsuite  # Bundled CRF model

If files exist locally, they’re used instead of downloading.

HuggingFace Integration

Resources are hosted on HuggingFace Datasets:

HF_REPO = "https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve/v1.0.0"

RESOURCE_URLS = {
    "segmentation": f"{HF_REPO}/segmentation/segmentation.mmap",
    "crf": f"{HF_REPO}/models/wordseg_c2_crf.crfsuite",
}

Download Behavior

First Use

# First time: Downloads and caches
path = get_segmentation_mmap_path()
# Output:
# INFO: Downloading segmentation resource (first time only)...
# INFO: Download complete: ~/.cache/myspellchecker/resources/segmentation.mmap

Subsequent Uses

# Subsequent calls: Uses cache (silent)
path = get_segmentation_mmap_path()
# No output - uses cached file

Force Download

# Force re-download
path = get_segmentation_mmap_path(force_download=True)
# Output:
# INFO: Downloading segmentation resource...

Error Handling

Unknown Resource

try:
    path = get_resource_path("unknown_resource")
except ValueError as e:
    print(e)  # "Unknown resource: unknown_resource. Available: ['segmentation', 'crf']"

Download Failure

try:
    path = get_resource_path("segmentation")
except RuntimeError as e:
    print(f"Download failed: {e}")

Network Issues

The loader handles network failures gracefully:

from myspellchecker.tokenizers.resource_loader import get_resource_path

try:
    path = get_resource_path("segmentation")
except Exception as e:
    # Fall back to bundled resource if available
    local_path = Path("data/models/segmentation.mmap")
    if local_path.exists():
        path = local_path
    else:
        raise

Integration Examples

With DefaultSegmenter

from myspellchecker.segmenters import DefaultSegmenter

# Resources are loaded automatically with default word engine
segmenter = DefaultSegmenter(word_engine="myword")

# Alternative engines
segmenter_crf = DefaultSegmenter(word_engine="crf")

With CRF Tokenizer

from myspellchecker.tokenizers.resource_loader import get_crf_model_path

# Get CRF model path (automatically cached)
model_path = get_crf_model_path()
print(f"CRF model cached at: {model_path}")

# Note: CRF model is used internally by segmenters when word_engine='crf'
# See configuration documentation for usage

Offline Mode

Pre-download resources for offline use:

# Download all resources during installation/setup
from myspellchecker.tokenizers.resource_loader import ensure_resources_available

# Download to default cache
ensure_resources_available()

# Or download to specific directory for deployment
ensure_resources_available(cache_dir=Path("/app/resources"))

Configuration

Environment Variables

Variable	Description	Default
`MYSPELL_CACHE_DIR`	Custom cache directory	`~/.cache/myspellchecker/resources`
`MYSPELL_OFFLINE`	Disable downloads	`false`

Custom Cache Directory

import os
from pathlib import Path
from myspellchecker.tokenizers.resource_loader import get_resource_path

# Via environment variable
os.environ["MYSPELL_CACHE_DIR"] = "/custom/path"

# Or via function parameter
path = get_resource_path("segmentation", cache_dir=Path("/custom/path"))

Best Practices

1. Pre-download in Production

# In deployment script
from myspellchecker.tokenizers.resource_loader import ensure_resources_available

# Download all resources before starting application
paths = ensure_resources_available()
print(f"Resources ready: {paths}")

2. Use Custom Cache for Containers

# In Docker, use a persistent volume
ensure_resources_available(cache_dir=Path("/app/data/resources"))

3. Handle Network Failures

import logging
from myspellchecker.tokenizers.resource_loader import get_resource_path

try:
    path = get_resource_path("segmentation")
except Exception as e:
    logging.warning(f"Failed to download resource: {e}")
    # Use fallback or raise user-friendly error

4. Clear Cache Periodically

# Clear and re-download to get latest resources
from myspellchecker.tokenizers.resource_loader import clear_cache, ensure_resources_available

clear_cache()
ensure_resources_available()

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​Overview

​Available Resources

​Resource Resolution

​Core Functions

​get_resource_path

​Convenience Functions

​ensure_resources_available

​clear_cache

​Cache Locations

​Default Cache Directory

​Local Bundled Paths

​HuggingFace Integration

​Download Behavior

​First Use

​Subsequent Uses

​Force Download

​Error Handling

​Unknown Resource

​Download Failure

​Network Issues

​Integration Examples

​With DefaultSegmenter

​With CRF Tokenizer

​Offline Mode

​Configuration

​Environment Variables

​Custom Cache Directory

​Best Practices

​1. Pre-download in Production

​2. Use Custom Cache for Containers

​3. Handle Network Failures

​4. Clear Cache Periodically

​See Also