Configuring Cloud Cache
Dremio Cloud Cache parameters are defined in dremio.conf
.
Supported Data Sources
Dremio supports cloud-caching for Parquet files on the following data sources:
- Amazon S3
- Azure Storage (ADLS Gen 2) - v2 only
- Google Cloud Storage
- HDFS
- Hive on S3, Azure Storage, GCS, and HDFS
Enable Caching
Enable cloud-caching for supported data sources either when adding a new data source to your deployment or later by editing the data source.
To enable cloud-caching:
-
On the Datasets page, select a supported data source in the Data Lakes list.
-
In the top-right corner of the page, click the gear icon.
-
In the Edit Source dialog, follow these steps:
-
Select Advanced Options, and then select Enable asynchronous access when possible.
-
Under Cache Options, select Enable local caching when possible.
-
(Optional) In the Max percent of total available cache space to use when possible field, specify the maximum percentage of cache space that can be used by a data source on any single executor node. The default is 100% of the total disk space available on the mount point provided for caching.
-
Click Save.
-
Configuring the Cache Path and Directory
The following can be specified:
executor.cache.path.db
(optional) Database directory path - This is the path for the database directory to use for storing cached data. If you do not specify a database path,paths.local
directory path is used.executor.cache.path.fs
(optional) File system path - This is the file system cache directory. If you do not specify a file system path, thepaths.local
path is used.
Cache is lost if the directory location is changed or the executor is restarted.
Dremio recommends a cloud cache directory size of at least 100 GB. See Additition Storage for more information.
Example
In the following dremio.conf
example, a database path and four (4) file system paths are specified.
If these paths were not specified, the cache manager uses the local path /mnt/resource/dremio/data
.
Dremio uses 70 percent of the total available disk space for the specified database and file system mount paths.
paths: {
# the local path for dremio to store data.
local: "/mnt/resource/dremio/data"
# the distributed path Dremio data including job results, downloads, uploads, etc
#dist: "file://"${paths.local}"/pdfs"
}
services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true,
executor.cache.path.db : "/mnt/cachemanagerdisk/db",
executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
}
zookeeper: "lak-azure-perf:"${services.coordinator.master.embedded-zookeeper.port}
Setting Up Caching for Reflection Data
You can improve the performance of queries that use reflections by enabling caching of reflection data.
If you are using a cloud storage provider, such as AWS, Google Cloud Platform, or Microsoft Azure, as a distributed store, caching is enabled by default. If you want to disable it, add the support key reflection.cloud.cache.enabled
and set it to false
. See Support Keys to learn how to add a support key.
If you are using HDFS as a distributed store, uncomment dist.caching.enabled
in the debug
section of dremio.conf
, and set it to true
. Then, restart the cluster.
Best Practices
The following describes best practices for cloud-caching:
- Use SSD/NVMe disks for best performance.
- Provide disks with sufficient space to benefit from caching. Note that the cache will evict unused data when required.
- To add capacity to the cache, add additional disks to the
executor.cache.path.fs
property in thedremio.conf
file. Note that removing a disk is not supported. - When removing local caching on executor nodes, you need to:
- Remove the Cloud Cache options on the data source.
- Delete the Cache Manager database and file system folders.