Disk-based precomputed weights caching
Note
This caching is only related to the precomputed
and precomputed-local backends in regrid().
Purpose
earthkit-regrid uses a dedicated directory to store interpolation matrices and the related index file downloaded from the remote inventory. By default this directory serves a cache and is managed (its size is checked/limited). It means if we run regrid() again with the same input and output grid it will load the matrix from the cache instead of downloading it again. Additionally, caching offers monitoring and disk space management. When the cache is full, cached data is deleted according to the configuration (i.e. oldest data is deleted first). The cache is implemented by using a sqlite database running in a separate thread.
Please note that the earthkit-regrid cache configuration is managed through the Configuration.
Warning
The earthkit-regrid cache is intended to be used by a single user. Sharing cache with multiple users is not recommended. Downloading a local copy of data on a shared disk to have multiple users working is a different use case and should be supported through using mirrors.
Cache policies
The primary config option to control the cache is cache-policy, which can take the following values:
The cache location can be read and modified with Python (see the details below).
Tip
See the Matrix disk cache notebook for examples.
Note
It is recommended to restart your Jupyter kernels after changing the cache policy or location.
User cache policy
When the cache-policy is “user” the cache will be active and created in a managed directory defined by the user-cache-directory config option. This is the default value.
Note
The default location of the user cache directory is "~/.cache/earthkit-regrid" and its maximum size is 5 GB.
The user cache directory is not cleaned up on exit. So next time you start earthkit-regrid it will be there again unless it is deleted manually or it is set in way that on each startup a different path is assigned to it. Also, when you run multiple sessions of earthkit-regrid under the same user they will share the same cache.
We can query the directory path via the Configuration and also by calling the directory() cache method.
>>> from earthkit.regrid import cache, config
>>> config.set("cache-policy", "user")
>>> config.get("user-cache-directory")
'/Users/username/.cache/earthkit-regrid'
>>> cache.directory()
'/Users/username/.cache/earthkit-regrid'
The following code shows how to change the user-cache-directory config option:
>>> from earthkit.regrid import config
>>> config.get("user-cache-directory") # Find the current cache directory
'/Users/username/.cache/earthkit-regrid'
>>> # Change the value of the setting
>>> config.set("user-cache-directory", "/big-disk/earthkit-regrid-cache")
# Python kernel restarted
>>> from earthkit.regrid import config
>>> config.get("user-cache-directory") # Cache directory has been modified
'/big-disk/earthkit-regrid-cache'
More generally, the earthkit-regrid config options can be read, modified, reset to their default values from Python, see the Configs documentation.
Temporary cache policy
When the cache-policy is “temporary” the cache will be active and located in a managed temporary directory created by tempfile.TemporaryDirectory. This directory will be unique for each earthkit-regrid session. When the directory object goes out of scope (at the latest on exit) the cache is cleaned up.
Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.
>>> from earthkit.regrid import cache, config
>>> config.set("cache-policy", "temporary")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'
We can specify the parent directory for the the temporary cache by using the temporary-cache-directory-root config option. By default it is set to None (no parent directory specified).
>>> from earthkit.regrid import cache, setting
>>> s = {
... "cache-policy": "temporary",
... "temporary-cache-directory-root": "~/my_demo_cache",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_cache/tmp0iiuvsz5'
Off cache policy
When the cache-policy is “off” no disk-based caching is available. In this case all files are downloaded into an unmanaged temporary directory created by tempfile.TemporaryDirectory. Since caching is disabled, all repeated calls to regrid() will download the interpolation matrix again! This temporary directory will be unique for each earthkit-regrid session. When the directory object goes out of scope (at the latest on exit) the directory will be cleaned up.
Due to the temporary nature of this directory path it cannot be queried via the Configuration, but we need to call the directory() cache method.
>>> from earthkit.regrid import cache, config
>>> config.set("cache-policy", "off")
>>> cache.directory()
'/var/folders/ng/g0zkhc2s42xbslpsywwp_26m0000gn/T/tmp_5bf5kq8'
We can specify the parent directory for the the temporary directory by using the temporary-directory-root config. By default it is set to None (no parent directory specified).
>>> from earthkit.regrid import cache, setting
>>> s = {
... "cache-policy": "off",
... "temporary-directory-root": "~/my_demo_tmp",
... }
>>> config.set(s)
>>> cache.directory()
'~/my_demo_tmp/tmp0iiuvsz5'
Cache methods
The cache is controlled by a global object, which we can access as earthkit.regrid.cache.
>>> from earthkit.regrid import cache
>>> cache
<earthkit.regrid.utils.caching.Cache object at 0x117be7040>
When cache-policy is user or temporary
there are a set of methods available on this object to manage and interact with the cache.
Methods |
Description |
|---|---|
|
Get the current cache policy object. |
|
Return the path to the current cache directory |
|
Return the total number of bytes stored in the cache |
|
Check the cache size and trim it down when needed. |
|
Dump the entries stored in the cache |
|
Return the number of items and total size of the cache |
|
Delete entries from the cache |
Warning
check_size() automatically runs when a new
entry is added to the cache or any of the Cache config parameters changes.
Examples:
>>> from earthkit.regrid import cache
>>> cache.policy.name
'user'
>>> cache.directory()
'/Users/username/.cache/earthkit-regrid/''
>>> cache.size()
846785699
>>> cache.summary_dump_database()
(40, 846785699)
>>> d = cache.entries()
>>> len(d)
40
>>> d[0].get("creation_date")
'2023-10-30 14:48:31.320322'
Cache limits
Warning
These config options do not work when cache-policy is off.
- Maximum-cache-size
The
maximum-cache-sizesetting ensures that earthkit-regrid does not use to much disk space. Its value sets the maximum disk space used by earthkit-regrid cache. When earthkit-regrid cache disk usage goes above this limit, earthkit-regrid triggers its cache cleaning mechanism before downloading additional data. The value of cache-maximum-size is absolute (such as “10G”, “10M”, “1K”). To disable it use None.- Maximum-cache-disk-usage
The
maximum-cache-disk-usagesetting ensures that earthkit-regrid leaves does not fill your disk. Its values sets the maximum disk usage as % of the filesystem containing the cache directory. When the disk space goes below this limit, earthkit-regrid triggers its cache cleaning mechanism before downloading additional data. The value of maximum-cache-disk-usage is relative (such as “90%” or “100%”). To disable it use None.
Warning
If your disk is filled by another application, earthkit-regrid will happily delete its cached data to make room for the other application as soon as it has a chance.
Cache config parameters
Name |
Default |
Description |
|---|---|---|
cache‑policy |
‘user’ |
Caching policy. Valid values: off, temporary and user. See Disk-based precomputed weights caching for more information. |
maximum‑cache‑disk‑usage |
None |
Disk usage threshold after which earthkit-regrid expires older cached entries (% of the full disk capacity). Can be set to None. See Disk-based precomputed weights caching for more information. |
maximum‑cache‑size |
‘5GB’ |
Maximum disk space used by the earthkit-regrid cache (e.g.: 100G or 2T). Can be set to None. |
temporary‑cache‑directory‑root |
None |
Parent of the cache directory when |
user‑cache‑directory |
‘~/.cache/earthkit‑regrid’ |
Cache directory used when |
Other earthkit-regrid config options can be found here.