On this page

    Configuring Cloud Cache

    This topic describes how to configure cloud-caching.

    Supported Data Sources

    Dremio supports cloud-caching for Parquet files on the following data sources:

    • Amazon S3
    • ADLS (Gen 1)
    • Azure Storage (ADLS Gen 2) - v2 only
    • HDFS
    • Hive on S3, ADLS, Azure Storage, and HDFS

    note:

    Tip: Dremio AWS Edition provides cloud-caching without manual configuration.

    Enable Caching

    Enable cloud-caching for supported data sources either when adding a new data source to your deployment or later by editing the data source.

    To enable cloud-caching:

    1. On the Datasets page, select a supported data source in the Data Lakes list.

    2. In the top-right corner of the page, click the gear icon.

    3. In the Edit Source dialog, follow these steps:

      a. Select Advanced Options, and then select Enable asynchronous access when possible.

      b. Under Cache Options, select Enable local caching when possible.

      c. Click Save.

    The Max percent of total available cache space to use when possible option specifies the the disk quota that can be used by a data source on any single executor node. The default is 100% of the total disk space available on the mount point provided for caching.

    Setting Up Cache Path and Directory

    For the Dremio cluster, the following must be specified:

    • Database path - This is the path for the database directory to use for storing cached data.
    • Cache directories – This is the mount point (base directory) for storing all the data related to caching on the node. Cache is lost if the directory location is changed and the executor is restarted.

    Configuring via dremio.conf

    To provision cloud cache you can configure the following dremio.conf settings:

    • Local directory path – By default, the cache manager uses the local directory path if database or file system paths are not specified..
    • Database path – The executor.cache.path.db setting provides the database directory path. If you do not specify a database path, the local directory path is used.
    • File system path – The executor.cache.path.fs setting provides the file system cache directory. Note that for good performance, SSD/NVMe disks are recommended for cloud cache. If you do not specify a file system path, the local directory path is used.

    Example dremio.conf

    In the following dremio.conf example, a database path and four (4) file system paths are specified. Both the database and file system paths are optional. If these paths were not specified, the cache manager uses the local path (/mnt/resource/dremio/data). Dremio uses 70 percent of the total available disk space for the specified database and file system mount paths.

    paths: {
      # the local path for dremio to store data.
      local: "/mnt/resource/dremio/data"
    
      # the distributed path Dremio data including job results, downloads, uploads, etc
      #dist: "pdfs://"${paths.local}"/pdfs"
    }
    
    services: {
      coordinator.enabled: false,
      coordinator.master.enabled: false,
      executor.enabled: true
      executor.cache.path.db : "/mnt/cachemanagerdisk/db",
      executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
    }
    zookeeper: "lak-azure-perf:"${services.coordinator.master.embedded-zookeeper.port}
    

    Setting Up Caching for Reflection Data

    You can improve the performance of queries that use reflections by enabling caching on your distributed store. Doing so also enables caching of reflection data because Dremio stores data related to reflections, in addition to job results and uploads, on your distributed store.

    To cache data, including reflection data, that is in a distributed store:

    1. In the debug section of dremio.conf, enable caching with the setting that applies to your environment:
      • If you are using HDFS for distributed storage, uncomment dist.caching.enabled and set it to true.
      • If you are using a cloud storage provider, such as AWS, Google Cloud Platform, or Microsoft Azure, uncomment reflection.cloud.cache.enabled and set it to true.
    2. Restart the cluster. This step is required for the change to take effect.

    Best Practices

    The following describes best practices for cloud-caching:

    • Use SSD/NVMe disks for best performance.
    • Provide disks with sufficient space to benefit from caching. Note that the cache will evict unused data when required.
    • To add capacity to the cache, add additional disks to the executor.cache.path.fs property in the dremio.conf file. Note that removing a disk is not supported.
    • When removing local caching on executor nodes, you need to:
      1. Remove the Cloud Cache options on the data source.
      2. Delete the Cache Manager database and file system folders.