Configuring Cloud Cache

This topic describes how to configure cloud caching.

Support

As of Dremio 4.0, cloud caching is available for Parquet file on the following data sources:

  • Amazon S3
  • ADLS (Gen 1)
  • Azure Storage (ADLS Gen 2) - v2 only

Enable Caching

Cloud caching can be enabled and configured either when adding a data source or editing data source settings.

To enable cloud caching:

  1. For the data source, navigate to Advanced Options > Cache Options.
  2. Specify the following options and Save.
    • Enable local caching when possible -- This setting disables/enables caching for the data source.
    • Max percent of total available cache space to use when possible -- This is the disk quota that can be used by a data source on any single executor node. The default is 100% of the total disk space available on the mount point provided for caching.

The following screenshot shows the Advanced Options when adding a new Amazon S3 data source:

Setup Cache Path and Directory

For the Dremio cluster, the following must be specified:

  • Database path - This is the path for the database directory to use for storing cached data.
  • Cache directories -- This is the mount point (base directory) for storing all the data related to caching on the node. Cache is lost if the directory location is changed and the executor is restarted.

Configuring via dremio.conf

To provision cloud cache you can configure the following dremio.conf settings:

  • Local directory path -- By default, the cache manager uses the local directory path if database or file system paths are not specified..
  • Database path -- The executor.cache.path.db setting provides the database directory path. If you do not specify a database path, the local directory path is used.
  • File system path -- The executor.cache.path.fs setting provides the file system cache directory. Note that for good performance, SSD/NVMe disks are recommended for cloud cache. If you do not specify a file system path, the local directory path is used.

Example dremio.conf

In the following dremio.conf example, a database path and four (4) file system paths are specified. Both the database and file system paths are optional. If these paths were not specified, the cache manager uses the local path (/mnt/resource/dremio/data). Since the quota is not specified in this example, the default quota of 70% is used for the database and file system mount paths.

paths: {
  # the local path for dremio to store data.
  local: "/mnt/resource/dremio/data"

  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
}

services: {
  coordinator.enabled: false,
  coordinator.master.enabled: false,
  executor.enabled: true
  executor.cache.path.db : "/mnt/cachemanagerdisk/db",
  executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
}
zookeeper: "lak-azure-perf:"${services.coordinator.master.embedded-zookeeper.port}

Example dremio.conf with quota

In the following dremio.conf example, the executor.cache.pctquota.db and executor.cache.pctquota.fs settings are used to specify quotas for the database and file system mount paths.

paths: {
  # the local path for dremio to store data.
  local: "/mnt/resource/dremio/data"

  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
}

services: {
  coordinator.enabled: false,
  coordinator.master.enabled: false,
  executor.enabled: true
  executor.cache.path.db : "/mnt/cachemanagerdisk/db",
  executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
  executor.cache.pctquota.db : 70
  executor.cache.pctquota.fs : [70,50,50,70]
}
zookeeper: "lak-azure-perf:"${services.coordinator.master.embedded-zookeeper.port}

Setup Caching for Reflection Data

If reflection data is in cloud store and if want to use caching for reflection data:

  1. Set dist.caching.enabled: true in dremio.conf under the debug section.
  2. Restart the cluster. This step is required for the change to take effect.
paths: {
  # the local path for dremio to store data.
  local: ${DREMIO_HOME}"/data"

  # the distributed path Dremio data including job results, downloads, uploads, etc
  dist: "dremioS3:///qa1.dremio.com/testdata/c3_dist/"
}

services: {
  coordinator.enabled: false,
  coordinator.master.enabled: false,
  executor.enabled: true
}
debug: {
  # Enable caching for distributed storage, it is turned off by default
  dist.caching.enabled: true,
  # Max percent of total available cache space to use when possible for distributed storage
  dist.max.cache.space.percent: 100
}

Best Practices

The following describes best practices for cloud caching:

  • Use SSD/NVMe disks for best performance.
  • Provide disks with sufficient space to benefit from caching. Note that the cache will evict unused data when required.
  • To add capacity to the cache, add additional disks to the executor.cache.path.fs property in the dremio.conf file. Note that removing a disk is not supported.
  • To improve performance while using reflections, enable caching on your distributed store. By enabling caching on your distributed store, caching of reflection data is also enabled because Dremio stores data related to reflections, job results, download, and uploads on the distributed store.
  • When removing local caching on executor nodes, you need to:
    1. Remove the Cloud Cache options on the data source.
    2. Delete the Cache Manager database and file system folders.

results matching ""

    No results matching ""