This topic describes how to configure cloud caching.
Dremio supports cloud caching for Parquet files on the following data sources:
Tip
Dremio AWS Edition provides cloud caching without manual configuration.
Enable cloud caching for supported data sources either when adding a new data source to your deployment or later by editing the data source.
To enable cloud caching:
Datasets
page, select a data set based on a supported data source.Advanced Options
pane, check Enable asynchronous access when possible
.Cache Options
pane, check Enable local caching when possible
.Save
.The Max percent of total available cache space to use when possible
option specifies the the disk quota that can be used by a data source on any single executor node. The default is 100% of the total disk space available on the mount point provided for caching.
The following screenshot displays the Edit Source
dialog for an Amazon S3 data source:
For the Dremio cluster, the following must be specified:
To provision cloud cache you can configure the following dremio.conf settings:
executor.cache.path.db
setting provides the database directory path.
If you do not specify a database path, the local directory path is used.executor.cache.path.fs
setting provides the file system cache directory.
Note that for good performance, SSD/NVMe disks are recommended for cloud cache.
If you do not specify a file system path, the local directory path is used.In the following dremio.conf example, a database path and four (4) file system paths are specified. Both the database and file system paths are optional. If these paths were not specified, the cache manager uses the local path (/mnt/resource/dremio/data). Dremio uses 70 percent of the total available disk space for the specified database and file system mount paths.
paths: {
# the local path for dremio to store data.
local: "/mnt/resource/dremio/data"
# the distributed path Dremio data including job results, downloads, uploads, etc
#dist: "pdfs://"${paths.local}"/pdfs"
}
services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true
executor.cache.path.db : "/mnt/cachemanagerdisk/db",
executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
}
zookeeper: "lak-azure-perf:"${services.coordinator.master.embedded-zookeeper.port}
If reflection data is in cloud store and if want to use caching for reflection data:
dist.caching.enabled: true
in dremio.conf under the debug section.paths: {
# the local path for dremio to store data.
local: ${DREMIO_HOME}"/data"
# the distributed path Dremio data including job results, downloads, uploads, etc
dist: "dremioS3:///qa1.dremio.com/testdata/c3_dist/"
}
services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true
}
debug: {
# Enable caching for distributed storage, it is turned off by default
dist.caching.enabled: true,
# Max percent of total available cache space to use when possible for distributed storage
dist.max.cache.space.percent: 100
}
The following describes best practices for cloud caching:
executor.cache.path.fs
property in the dremio.conf file.
Note that removing a disk is not supported.