Configuring Cloud Cache
This topic describes how to configure cloud-caching.
Supported Data Sources
Dremio supports cloud-caching for Parquet files on the following data sources:
- Amazon S3
- ADLS (Gen 1)
- Azure Storage (ADLS Gen 2) - v2 only
- HDFS
- Hive on S3, ADLS, Azure Storage, and HDFS
Tip: Dremio AWS Edition provides cloud-caching without manual configuration.
Disk Space Recommendations
Dremio recommends that the cloud cache volume contains at least 100 GB of available space. Cleaning this space is not required as its an LRU cache.
Enable Caching
Enable cloud-caching for supported data sources either when adding a new data source to your deployment or later by editing the data source.
To enable cloud-caching:
-
On the Datasets page, select a supported data source in the Data Lakes list.
-
In the top-right corner of the page, click the gear icon.
-
In the Edit Source dialog, follow these steps:
a. Select Advanced Options, and then select Enable asynchronous access when possible.
b. Under Cache Options, select Enable local caching when possible.
c. (Optional) In the Max percent of total available cache space to use when possible field, specify the maximum percentage of cache space that can be used by a data source on any single executor node. The default is 100% of the total disk space available on the mount point provided for caching.
cautionThe percentage allowed must be equal to at least 10GB. If you leave the default percentage, be sure that the available storage space is at least 10GB. If you decrease the percentage, be sure that the available storage space is proportionally greater than 10GB.
d. Click Save.
Setting Up Cache Path and Directory
For the Dremio cluster, the following must be specified:
- Database path - This is the path for the database directory to use for storing cached data.
- Cache directories -- This is the mount point (base directory) for storing all the data related to caching on the node. Cache is lost if the directory location is changed and the executor is restarted.
Configuring via dremio.conf
To provision cloud cache you can configure the following dremio.conf settings:
- Local directory path -- By default, the cache manager uses the local directory path if database or file system paths are not specified..
- Database path -- The
executor.cache.path.db
setting provides the database directory path. If you do not specify a database path, the local directory path is used. - File system path -- The
executor.cache.path.fs
setting provides the file system cache directory. Note that for good performance, SSD/NVMe disks are recommended for cloud cache. If you do not specify a file system path, the local directory path is used.
Example dremio.conf
In the following dremio.conf example, a database path and four (4) file system paths are specified. Both the database and file system paths are optional. If these paths were not specified, the cache manager uses the local path (/mnt/resource/dremio/data). Dremio uses 70 percent of the total available disk space for the specified database and file system mount paths.
paths: {
# the local path for dremio to store data.
local: "/mnt/resource/dremio/data"
# the distributed path Dremio data including job results, downloads, uploads, etc
#dist: "pdfs://"${paths.local}"/pdfs"
}
services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true,
executor.cache.path.db : "/mnt/cachemanagerdisk/db",
executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
}
zookeeper: "lak-azure-perf:"${services.coordinator.master.embedded-zookeeper.port}
Setting Up Caching for Reflection Data
You can improve the performance of queries that use reflections by enabling caching of reflection data in your distributed store.
If you are using a cloud storage provider, such as AWS, Google Cloud Platform, or Microsoft Azure, as a distributed store, caching is enabled by default. If you want to disable it, add the support key reflection.cloud.cache.enabled
and set it to false
. See Support Keys to learn how to add a support key.
If you are using HDFS as a distributed store, uncomment dist.caching.enabled
in the debug
section of dremio.conf, and set it to true
. Then, restart the cluster.
Best Practices
The following describes best practices for cloud-caching:
- Use SSD/NVMe disks for best performance.
- Provide disks with sufficient space to benefit from caching. Note that the cache will evict unused data when required.
- To add capacity to the cache, add additional disks to the
executor.cache.path.fs
property in the dremio.conf file. Note that removing a disk is not supported. - When removing local caching on executor nodes, you need to:
- Remove the Cloud Cache options on the data source.
- Delete the Cache Manager database and file system folders.