Skip to main content
The myword and CRF word segmentation engines require binary resources (dictionary and model files) that are hosted on HuggingFace and downloaded automatically on first use. This page covers cache management, offline mode, and custom cache directories.

Overview

from myspellchecker.tokenizers.resource_loader import (
    get_resource_path,
    get_segmentation_mmap_path,
    get_crf_model_path,
    ensure_resources_available,
)

# Get path to resource (auto-downloads if needed)
mmap_path = get_segmentation_mmap_path()
crf_path = get_crf_model_path()

Available Resources

ResourceDescriptionSizeUse Case
segmentationWord segmentation dictionary (mmap)~50MBWord tokenization
crfCRF model for word segmentation~10MBWord tokenization

Resource Resolution

Resources are resolved in this order:
  1. Local bundled path: Check if package includes the file
  2. Cache directory: Check previously downloaded files
  3. Download: Fetch from HuggingFace on first use
def get_resource_path(
    name: str,
    cache_dir: Optional[Path] = None,
    force_download: bool = False,
) -> Path:
    """Get path to a resource, downloading if necessary.

    Args:
        name: Resource name ("segmentation", "crf")
        cache_dir: Custom cache directory
        force_download: Force re-download even if cached

    Returns:
        Path to the resource file

    Raises:
        ValueError: If resource name is unknown
        RuntimeError: If download fails
    """

Core Functions

get_resource_path

General-purpose resource getter:
from myspellchecker.tokenizers.resource_loader import get_resource_path

# Get segmentation dictionary
path = get_resource_path("segmentation")

# Get CRF model
path = get_resource_path("crf")

# Force re-download
path = get_resource_path("segmentation", force_download=True)

# Custom cache directory
path = get_resource_path("crf", cache_dir=Path("/custom/cache"))

Convenience Functions

Specific getters for each resource:
from myspellchecker.tokenizers.resource_loader import (
    get_segmentation_mmap_path,
    get_crf_model_path,
)

# Word segmentation dictionary
mmap_path = get_segmentation_mmap_path()

# CRF syllable model
crf_path = get_crf_model_path()

ensure_resources_available

Download all resources at once:
from myspellchecker.tokenizers.resource_loader import ensure_resources_available

# Download all missing resources
paths = ensure_resources_available()
print(paths)
# {
#     "segmentation": Path("~/.cache/myspellchecker/resources/segmentation.mmap"),
#     "crf": Path("~/.cache/myspellchecker/resources/wordseg_c2_crf.crfsuite"),
# }

# Force re-download all
paths = ensure_resources_available(force_download=True)

clear_cache

Clear the cache directory:
from myspellchecker.tokenizers.resource_loader import clear_cache

# Clear default cache
clear_cache()

# Clear custom cache
clear_cache(cache_dir=Path("/custom/cache"))

Cache Locations

Default Cache Directory

DEFAULT_CACHE_DIR = Path.home() / ".cache" / "myspellchecker" / "resources"
# Example: /home/user/.cache/myspellchecker/resources/

Local Bundled Paths

Resources can be bundled with the package:
# Project structure
myspellchecker/
├── data/
│   └── models/
│       ├── segmentation.mmap    # Bundled segmentation dict
│       └── wordseg_c2_crf.crfsuite  # Bundled CRF model
If files exist locally, they’re used instead of downloading.

HuggingFace Integration

Resources are hosted on HuggingFace Datasets:
HF_REPO = "https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve/v1.0.0"

RESOURCE_URLS = {
    "segmentation": f"{HF_REPO}/segmentation/segmentation.mmap",
    "crf": f"{HF_REPO}/models/wordseg_c2_crf.crfsuite",
}

Download Behavior

First Use

# First time: Downloads and caches
path = get_segmentation_mmap_path()
# Output:
# INFO: Downloading segmentation resource (first time only)...
# INFO: Download complete: ~/.cache/myspellchecker/resources/segmentation.mmap

Subsequent Uses

# Subsequent calls: Uses cache (silent)
path = get_segmentation_mmap_path()
# No output - uses cached file

Force Download

# Force re-download
path = get_segmentation_mmap_path(force_download=True)
# Output:
# INFO: Downloading segmentation resource...

Error Handling

Unknown Resource

try:
    path = get_resource_path("unknown_resource")
except ValueError as e:
    print(e)  # "Unknown resource: unknown_resource. Available: ['segmentation', 'crf']"

Download Failure

try:
    path = get_resource_path("segmentation")
except RuntimeError as e:
    print(f"Download failed: {e}")

Network Issues

The loader handles network failures gracefully:
from myspellchecker.tokenizers.resource_loader import get_resource_path

try:
    path = get_resource_path("segmentation")
except Exception as e:
    # Fall back to bundled resource if available
    local_path = Path("data/models/segmentation.mmap")
    if local_path.exists():
        path = local_path
    else:
        raise

Integration Examples

With DefaultSegmenter

from myspellchecker.segmenters import DefaultSegmenter

# Resources are loaded automatically with default word engine
segmenter = DefaultSegmenter(word_engine="myword")

# Alternative engines
segmenter_crf = DefaultSegmenter(word_engine="crf")

With CRF Tokenizer

from myspellchecker.tokenizers.resource_loader import get_crf_model_path

# Get CRF model path (automatically cached)
model_path = get_crf_model_path()
print(f"CRF model cached at: {model_path}")

# Note: CRF model is used internally by segmenters when word_engine='crf'
# See configuration documentation for usage

Offline Mode

Pre-download resources for offline use:
# Download all resources during installation/setup
from myspellchecker.tokenizers.resource_loader import ensure_resources_available

# Download to default cache
ensure_resources_available()

# Or download to specific directory for deployment
ensure_resources_available(cache_dir=Path("/app/resources"))

Configuration

Environment Variables

VariableDescriptionDefault
MYSPELL_CACHE_DIRCustom cache directory~/.cache/myspellchecker/resources
MYSPELL_OFFLINEDisable downloadsfalse

Custom Cache Directory

import os
from pathlib import Path
from myspellchecker.tokenizers.resource_loader import get_resource_path

# Via environment variable
os.environ["MYSPELL_CACHE_DIR"] = "/custom/path"

# Or via function parameter
path = get_resource_path("segmentation", cache_dir=Path("/custom/path"))

Best Practices

1. Pre-download in Production

# In deployment script
from myspellchecker.tokenizers.resource_loader import ensure_resources_available

# Download all resources before starting application
paths = ensure_resources_available()
print(f"Resources ready: {paths}")

2. Use Custom Cache for Containers

# In Docker, use a persistent volume
ensure_resources_available(cache_dir=Path("/app/data/resources"))

3. Handle Network Failures

import logging
from myspellchecker.tokenizers.resource_loader import get_resource_path

try:
    path = get_resource_path("segmentation")
except Exception as e:
    logging.warning(f"Failed to download resource: {e}")
    # Use fallback or raise user-friendly error

4. Clear Cache Periodically

# Clear and re-download to get latest resources
from myspellchecker.tokenizers.resource_loader import clear_cache, ensure_resources_available

clear_cache()
ensure_resources_available()

See Also