Configuring Cloud Cache
Dremio Cloud Cache parameters are defined in dremio.conf.
Supported Data Sources
Dremio supports cloud-caching for Parquet files on the following data sources:
- Amazon S3
- Azure Storage (ADLS Gen 2) - v2 only
- Google Cloud Storage
- HDFS
- Hive on S3, Azure Storage, GCS, and HDFS
Enable Caching
Enable cloud-caching for supported data sources either when adding a new data source to your deployment or later by editing the data source.
To enable cloud-caching:
- 
On the Datasets page, select a supported data source in the Data Lakes list. 
- 
In the top-right corner of the page, click the gear icon. 
- 
In the Edit Source dialog, follow these steps: - 
Select Advanced Options, and then select Enable asynchronous access when possible. 
- 
Under Cache Options, select Enable local caching when possible. 
- 
(Optional) In the Max percent of total available cache space to use when possible field, specify the maximum percentage of cache space that can be used by a data source on any single executor node. The default is 100% of the total disk space available on the mount point provided for caching. 
- 
Click Save. 
 
- 
Configuring the Cache Path and Directory
The following can be specified:
- executor.cache.path.db(optional) Database directory path - This is the path for the database directory to use for storing cached data. If you do not specify a database path,- paths.localdirectory path is used.
- executor.cache.path.fs(optional) File system path - This is the file system cache directory. If you do not specify a file system path, the- paths.localpath is used.
Cache is lost if the directory location is changed or the executor is restarted.
Dremio recommends a cloud cache directory size of at least 100 GB. See Additition Storage for more information.
Example
In the following dremio.conf example, a database path and four (4) file system paths are specified.
If these paths were not specified, the cache manager uses the local path /mnt/resource/dremio/data.
Dremio uses 70 percent of the total available disk space for the specified database and file system mount paths.
paths: {
    # the local path for dremio to store data.
    local: "/mnt/resource/dremio/data"
    # the distributed path Dremio data including job results, downloads, uploads, etc
    #dist: "file://"${paths.local}"/pdfs"
}
services: {
    coordinator.enabled: false,
    coordinator.master.enabled: false,
    executor.enabled: true,
    executor.cache.path.db : "/mnt/cachemanagerdisk/db",
    executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
}
zookeeper: "lak-azure-perf:"${services.coordinator.master.embedded-zookeeper.port}
Setting Up Caching for Reflection Data
You can improve the performance of queries that use Reflections by enabling caching of Reflection data.
If you are using a cloud storage provider, such as AWS, Google Cloud Platform, or Microsoft Azure, as a distributed store, caching is enabled by default.
If you are using HDFS as a distributed store, uncomment dist.caching.enabled in the debug section of dremio.conf, and set it to true. Then, restart the cluster.
Best Practices
The following describes best practices for cloud-caching:
- Use SSD/NVMe disks for best performance.
- Provide disks with sufficient space to benefit from caching. Note that the cache will evict unused data when required.
- To add capacity to the cache, add additional disks to the executor.cache.path.fsproperty in thedremio.conffile. Note that removing a disk is not supported.
- When removing local caching on executor nodes, you need to:
- Remove the Cloud Cache options on the data source.
- Delete the Cache Manager database and file system folders.