Disk-based precomputed weights caching

Note

This caching is only related to the precomputed and precomputed-local backends in regrid().

Purpose

earthkit-regrid uses a dedicated directory to store interpolation matrices and the related index file downloaded from the remote inventory. By default this directory serves a cache and is managed (its size is checked/limited). It means if we run regrid() again with the same input and output grid it will load the matrix from the cache instead of downloading it again. Additionally, caching offers monitoring and disk space management. When the cache is full, cached data is deleted according to the configuration (i.e. oldest data is deleted first). The cache is implemented by using a sqlite database running in a separate thread.

Please note that the earthkit-regrid cache configuration is managed through the Configuration.

Warning

The earthkit-regrid cache is intended to be used by a single user. Sharing cache with multiple users is not recommended. Downloading a local copy of data on a shared disk to have multiple users working is a different use case and should be supported through using mirrors.

Cache policies

The primary config option to control the cache is cache-policy, which can take the following values:

The cache location can be read and modified with Python (see the details below).

Tip

See the Matrix disk cache notebook for examples.

Note

It is recommended to restart your Jupyter kernels after changing the cache policy or location.

User cache policy

When the cache-policy is “user” the cache will be active and created in a managed directory defined by the user-cache-directory config option. This is the default value.

Note

The default location of the user cache directory is "~/.cache/earthkit-regrid" and its maximum size is 5 GB.

The user cache directory is not cleaned up on exit. So next time you start earthkit-regrid it will be there again unless it is deleted manually or it is set in way that on each startup a different path is assigned to it. Also, when you run multiple sessions of earthkit-regrid under the same user they will share the same cache.

We can query the directory path via the Configuration and also by calling the directory() cache method.

>>> from earthkit.regrid import cache, config
>>> config.set("cache-policy", "user")
>>> config.get("user-cache-directory")
'/Users/username/.cache/earthkit-regrid'
>>> cache.directory()
'/Users/username/.cache/earthkit-regrid'

The following code shows how to change the user-cache-directory config option:

>>> from earthkit.regrid import config
>>> config.get("user-cache-directory")  # Find the current cache directory
'/Users/username/.cache/earthkit-regrid'
>>> # Change the value of the setting
>>> config.set("user-cache-directory", "/big-disk/earthkit-regrid-cache")

# Python kernel restarted

>>> from earthkit.regrid import config
>>> config.get("user-cache-directory")  # Cache directory has been modified
'/big-disk/earthkit-regrid-cache'

More generally, the earthkit-regrid config options can be read, modified, reset to their default values from Python, see the Configs documentation.

Temporary cache policy

When the cache-policy is “temporary” the cache will be active and located in a managed temporary directory created by tempfile.TemporaryDirectory. This directory will be unique for each earthkit-regrid session. When the directory object goes out of scope (at the latest on exit) the cache is cleaned up.

Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.

>>> from earthkit.regrid import cache, config
>>> config.set("cache-policy", "temporary")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'

We can specify the parent directory for the the temporary cache by using the temporary-cache-directory-root config option. By default it is set to None (no parent directory specified).

>>> from earthkit.regrid import cache, setting
>>> s = {
...     "cache-policy": "temporary",
...     "temporary-cache-directory-root": "~/my_demo_cache",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_cache/tmp0iiuvsz5'

Off cache policy

When the cache-policy is “off” no disk-based caching is available. In this case all files are downloaded into an unmanaged temporary directory created by tempfile.TemporaryDirectory. Since caching is disabled, all repeated calls to regrid() will download the interpolation matrix again! This temporary directory will be unique for each earthkit-regrid session. When the directory object goes out of scope (at the latest on exit) the directory will be cleaned up.

Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.

>>> from earthkit.regrid import cache, config
>>> config.set("cache-policy", "off")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'

We can specify the parent directory for the the temporary directory by using the temporary-directory-root config. By default it is set to None (no parent directory specified).

>>> from earthkit.regrid import cache, setting
>>> s = {
...     "cache-policy": "off",
...     "temporary-directory-root": "~/my_demo_tmp",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_tmp/tmp0iiuvsz5'

Cache methods

The cache is controlled by a global object, which we can access as earthkit.regrid.cache.

>>> from earthkit.regrid import cache
>>> cache
<earthkit.regrid.utils.caching.Cache object at 0x117be7040>

When cache-policy is user or temporary there are a set of methods available on this object to manage and interact with the cache.

Methods/properties of the cache object

Methods

Description

policy

Get the current cache policy object.

directory()

Return the path to the current cache directory

size()

Return the total number of bytes stored in the cache

check_size()

Check the cache size and trim it down when needed.

entries()

Dump the entries stored in the cache

summary_dump_database()

Return the number of items and total size of the cache

purge()

Delete entries from the cache

Warning

check_size() automatically runs when a new entry is added to the cache or any of the Cache config parameters changes.

Examples:

>>> from earthkit.regrid import cache
>>> cache.policy.name
'user'
>>> cache.directory()
'/Users/username/.cache/earthkit-regrid/''
>>> cache.size()
846785699
>>> cache.summary_dump_database()
(40, 846785699)
>>> d = cache.entries()
>>> len(d)
40
>>> d[0].get("creation_date")
'2023-10-30 14:48:31.320322'

Cache limits

Warning

These config options do not work when cache-policy is off.

Maximum-cache-size

The maximum-cache-size setting ensures that earthkit-regrid does not use to much disk space. Its value sets the maximum disk space used by earthkit-regrid cache. When earthkit-regrid cache disk usage goes above this limit, earthkit-regrid triggers its cache cleaning mechanism before downloading additional data. The value of cache-maximum-size is absolute (such as “10G”, “10M”, “1K”). To disable it use None.

Maximum-cache-disk-usage

The maximum-cache-disk-usage setting ensures that earthkit-regrid leaves does not fill your disk. Its values sets the maximum disk usage as % of the filesystem containing the cache directory. When the disk space goes below this limit, earthkit-regrid triggers its cache cleaning mechanism before downloading additional data. The value of maximum-cache-disk-usage is relative (such as “90%” or “100%”). To disable it use None.

Warning

If your disk is filled by another application, earthkit-regrid will happily delete its cached data to make room for the other application as soon as it has a chance.

Cache config parameters

Name

Default

Description

cache‑policy

‘user’

Caching policy. Valid values: off, temporary and user. See Disk-based precomputed weights caching for more information.

maximum‑cache‑disk‑usage

None

Disk usage threshold after which earthkit-regrid expires older cached entries (% of the full disk capacity). Can be set to None. See Disk-based precomputed weights caching for more information.

maximum‑cache‑size

‘5GB’

Maximum disk space used by the earthkit-regrid cache (e.g.: 100G or 2T). Can be set to None.

temporary‑cache‑directory‑root

None

Parent of the cache directory when cache-policy is temporary. See Disk-based precomputed weights caching for more information.

user‑cache‑directory

‘~/.cache/earthkit‑regrid’

Cache directory used when cache-policy is user. See Disk-based precomputed weights caching for more information.

Other earthkit-regrid config options can be found here.