Skip to main content
Word segmentation (myword and CRF engines) requires binary resources, specifically a dictionary file and a CRF model, that are too large to bundle with the pip package. They download from HuggingFace on first use and are cached locally. This page explains how the cache works, how to pre-populate it for offline or airgapped deployments, and how to point it at a custom directory.

Overview

from myspellchecker.tokenizers.resource_loader import (
    get_resource_path,
    get_segmentation_mmap_path,
    get_crf_model_path,
    get_curated_lexicon_path,
)

# Get path to resource (auto-downloads if needed)
mmap_path = get_segmentation_mmap_path()
crf_path = get_crf_model_path()
lexicon_path = get_curated_lexicon_path()

Available Resources

ResourceDescriptionSizeUse Case
segmentationWord segmentation dictionary (mmap)~50MBWord tokenization
crfCRF model for word segmentation~10MBWord tokenization
curated_lexiconCurated lexicon CSV~1MBDictionary enrichment

Resource Resolution

Resources are resolved in this order:
  1. Local bundled path: Check if package includes the file
  2. Cache directory: Check previously downloaded files
  3. Download: Fetch from HuggingFace on first use
def get_resource_path(
    name: str,
    cache_dir: Optional[Path] = None,
    force_download: bool = False,
) -> Path:
    """Get path to a resource, downloading if necessary.

    Args:
        name: Resource name ("segmentation", "crf")
        cache_dir: Custom cache directory
        force_download: Skip bundled resources (does NOT force
            re-download of cached resources — delete the cache
            directory manually to force a fresh download)

    Returns:
        Path to the resource file

    Raises:
        ValueError: If resource name is unknown
        RuntimeError: If download fails
    """

Core Functions

get_resource_path

General-purpose resource getter:
from myspellchecker.tokenizers.resource_loader import get_resource_path

# Get segmentation dictionary
path = get_resource_path("segmentation")

# Get CRF model
path = get_resource_path("crf")

# Skip bundled resources, use cached or download
path = get_resource_path("segmentation", force_download=True)

# Custom cache directory
path = get_resource_path("crf", cache_dir=Path("/custom/cache"))

Convenience Functions

Specific getters for each resource:
from myspellchecker.tokenizers.resource_loader import (
    get_segmentation_mmap_path,
    get_crf_model_path,
    get_curated_lexicon_path,
)

# Word segmentation dictionary
mmap_path = get_segmentation_mmap_path()

# CRF syllable model
crf_path = get_crf_model_path()

# Curated lexicon
lexicon_path = get_curated_lexicon_path()

Downloading All Resources

To pre-download all resources, call each getter:
from myspellchecker.tokenizers.resource_loader import (
    get_segmentation_mmap_path,
    get_crf_model_path,
    get_curated_lexicon_path,
)

# Download all resources (auto-caches)
mmap_path = get_segmentation_mmap_path()
crf_path = get_crf_model_path()
lexicon_path = get_curated_lexicon_path()

# Skip bundled resources, use cached or download
mmap_path = get_segmentation_mmap_path(force_download=True)
crf_path = get_crf_model_path(force_download=True)
lexicon_path = get_curated_lexicon_path(force_download=True)

Cache Locations

Default Cache Directory

DEFAULT_CACHE_DIR = Path.home() / ".cache" / "myspellchecker" / "resources"
# Example: /home/user/.cache/myspellchecker/resources/

Local Bundled Paths

Resources can be bundled with the package:
myspellchecker
data
models
segmentation.mmap
wordseg_c2_crf.crfsuite
If files exist locally, they’re used instead of downloading.

HuggingFace Integration

Resources are hosted on HuggingFace Datasets:
RESOURCE_VERSION = "main"
HF_REPO = f"https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve/{RESOURCE_VERSION}"

RESOURCE_URLS = {
    "segmentation": f"{HF_REPO}/segmentation/segmentation.mmap",
    "crf": f"{HF_REPO}/models/wordseg_c2_crf.crfsuite",
    "curated_lexicon": f"{HF_REPO}/curated_lexicon/curated_lexicon.csv",
}

Download Behavior

First Use

# First time: Downloads and caches
path = get_segmentation_mmap_path()
# Output:
# INFO: Downloading segmentation resource (first time only)...
# INFO: Download complete: ~/.cache/myspellchecker/resources/segmentation.mmap

Subsequent Uses

# Subsequent calls: Uses cache (silent)
path = get_segmentation_mmap_path()
# No output - uses cached file

Skip Bundled Resources

# Skip bundled resources, use cached or download from HuggingFace
path = get_segmentation_mmap_path(force_download=True)
# Note: This skips the package-bundled resource check, but does NOT
# re-download if the resource is already cached. To force a fresh
# download, delete the cache directory first.

Error Handling

Unknown Resource

try:
    path = get_resource_path("unknown_resource")
except ValueError as e:
    print(e)  # "Unknown resource: unknown_resource. Available: ['segmentation', 'crf']"

Download Failure

try:
    path = get_resource_path("segmentation")
except RuntimeError as e:
    print(f"Download failed: {e}")

Network Issues

The loader handles network failures gracefully:
from myspellchecker.tokenizers.resource_loader import get_resource_path

try:
    path = get_resource_path("segmentation")
except Exception as e:
    # Fall back to bundled resource if available
    local_path = Path("data/models/segmentation.mmap")
    if local_path.exists():
        path = local_path
    else:
        raise

Integration Examples

With DefaultSegmenter

from myspellchecker.segmenters import DefaultSegmenter

# Resources are loaded automatically with default word engine
segmenter = DefaultSegmenter(word_engine="myword")

# Alternative engines
segmenter_crf = DefaultSegmenter(word_engine="crf")

With CRF Tokenizer

from myspellchecker.tokenizers.resource_loader import get_crf_model_path

# Get CRF model path (automatically cached)
model_path = get_crf_model_path()
print(f"CRF model cached at: {model_path}")

# Note: CRF model is used internally by segmenters when word_engine='crf'
# See configuration documentation for usage

Offline Mode

Pre-download resources for offline use:
from myspellchecker.tokenizers.resource_loader import get_resource_path

# Download to default cache during installation/setup
get_resource_path("segmentation")
get_resource_path("crf")

# Or download to specific directory for deployment
from pathlib import Path
get_resource_path("segmentation", cache_dir=Path("/app/resources"))
get_resource_path("crf", cache_dir=Path("/app/resources"))

Configuration

Environment Variables

VariableDescriptionDefault
MYSPELL_CACHE_DIRCustom cache directory~/.cache/myspellchecker/resources
MYSPELL_OFFLINEDisable downloadsfalse

Custom Cache Directory

import os
from pathlib import Path
from myspellchecker.tokenizers.resource_loader import get_resource_path

# Via environment variable
os.environ["MYSPELL_CACHE_DIR"] = "/custom/path"

# Or via function parameter
path = get_resource_path("segmentation", cache_dir=Path("/custom/path"))

ResourceConfig

The ResourceConfig class provides a Pydantic model for configuring resource download and caching behavior. It controls the HuggingFace repository URL, version tag, and local cache directory.
from myspellchecker.core.config import ResourceConfig

resource_config = ResourceConfig(
    resource_version="main",       # HuggingFace version tag (bump for reproducibility)
    hf_repo_base="https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve",
    cache_dir=None,                # None = ~/.cache/myspellchecker/resources
)
FieldTypeDefaultDescription
resource_versionstr"main"Resource version tag on HuggingFace. Bump with releases for reproducibility.
hf_repo_basestr"https://huggingface.co/datasets/thettwe/myspellchecker-resources/resolve"Base URL for the HuggingFace dataset repository (without version suffix).
cache_dirstr | NoneNoneLocal cache directory. Defaults to ~/.cache/myspellchecker/resources. Can be overridden with MYSPELL_CACHE_DIR env var.

Airgapped / Offline Deployments

For environments without internet access, pre-download resources on a connected machine and point cache_dir to the local path:
from myspellchecker.core.config import ResourceConfig

# Point to pre-downloaded resources (no HuggingFace calls)
resource_config = ResourceConfig(
    cache_dir="/app/data/resources",
)

# Or pin a specific version for reproducible builds
resource_config = ResourceConfig(
    resource_version="v1.0.0",
    cache_dir="/app/data/resources",
)
Alternatively, set the environment variable to redirect all resource lookups:
export MYSPELL_CACHE_DIR=/app/data/resources
export MYSPELL_OFFLINE=true

Best Practices

1. Pre-download in Production

# In deployment script
from myspellchecker.tokenizers.resource_loader import get_resource_path

# Download all resources before starting application
seg_path = get_resource_path("segmentation")
crf_path = get_resource_path("crf")
print(f"Resources ready: segmentation={seg_path}, crf={crf_path}")

2. Use Custom Cache for Containers

from pathlib import Path
from myspellchecker.tokenizers.resource_loader import get_resource_path

# In Docker, use a persistent volume
get_resource_path("segmentation", cache_dir=Path("/app/data/resources"))
get_resource_path("crf", cache_dir=Path("/app/data/resources"))

3. Handle Network Failures

import logging
from myspellchecker.tokenizers.resource_loader import get_resource_path

try:
    path = get_resource_path("segmentation")
except Exception as e:
    logging.warning(f"Failed to download resource: {e}")
    # Use fallback or raise user-friendly error

4. Update Cached Resources

# force_download=True only skips bundled resources, it does NOT
# re-download cached files. To get fresh copies from HuggingFace,
# delete the cache directory first, then re-download:
import shutil
from pathlib import Path

cache_dir = Path.home() / ".cache" / "myspellchecker" / "resources"
shutil.rmtree(cache_dir, ignore_errors=True)

from myspellchecker.tokenizers.resource_loader import get_resource_path

get_resource_path("segmentation")
get_resource_path("crf")

See Also