Skip to main content
Version: current [25.0.x]

Configuring Distributed Storage

This topic specifies how to configure Dremio for distributed storage.

To configure distributed storage, the paths.dist property in the dremio.conf file must specify the cache location where Dremio holds the accelerator, table, job result, download, and upload data. If this property is updated, then it must be updated in the dremio.conf file on all nodes. By default, Dremio uses the disk space on local Dremio nodes.

StoreSupported?
Amazon S3Yes
Azure Data Lake Store (ADLS Gen1)Yes
Azure Storage (ADLS Gen2)Yes
Google Cloud StorageYes
HDFSYes
MapR-FSYes
NASYes

Disk Space Recommendations

For cluster deployments, we recommend the following disk space sizes for data lake distributed storage types. The distributed storage cache location is specified with the paths.dist property in the dremio.conf file. For more information, see Distributed Storage. Following are the disk space recommendations for distributed stores such as HDFS and NFS, where volume size is included in the configuration.

TypePurposeMinimum SizeCleanup
ReflectionsStored reflection materializations Start with 10% of data sizeAfter expiration with a grace period (configurable)
Unlimited split-metadataMetadata storage for unlimited split parquet-based datasetsScales with number of tables and refresh snapshots; start with 150GB and monitor sizeAs of Dremio 24.2, expired snapshots are gradually removed
Uploads & downloadsUser-uploaded datasets and downloaded dataStart with 500GBDrop tables; manually remove downloads
Job resultsReuse query resultsScales with number of jobs and length of retention; start with 150GB and monitor size Set cleanup threshold using results.max.age_in_days; jobs older than 30 days are automatically deleted by default
Scratch$scratch shared spaceVaries with number of objects storedDelete objects

Amazon S3

Before configuring Amazon S3 as Dremio's distributed storage:

  • Perform a test by adding the same bucket as a Dremio source and verify the connection.
  • Ensure that the following minimum policy requirements for storing reflections are provided:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::BUCKET-NAME",
"arn:aws:s3:::BUCKET-NAME/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
"s3:GetBucketLocation"
],
"Resource": "*"
}
]
}

Configuring Dremio for Amazon S3

note
  • In Dremio 24+, the fs.s3a.secret.key property can be encrypted using the dremio-admin encrypt CLI command. The encrypted value produced with the encrypt command must be prefixed with dremio+ (for example, dremio+secret:1.812WZLVORD26pwyAg8qKtQqw9Te8Xom5mdkSMmR_U4).

  • Before configuring, create a storage root directory first to use for the dremio.conf and core-site.xml files.

To set up configuration for distributed storage:

  1. Change the distributed property in the dremio.conf file.

    paths: {
    ...
    dist: "dremioS3:///<bucket_name>/<folder1>/<folder2>"
    }

    For example: dist: "dremioS3:///qa1.dremio.com/jduong/accel"

  2. Create core-site.xml and include IAM credentials with list, read and write permissions:
    You can use instance profile for configuring the distributed storage:

    <?xml version="1.0"?>
    <configuration>
    <property>
    <name>fs.dremioS3.impl</name>
    <description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
    <value>com.dremio.plugins.s3.store.S3FileSystem</value>
    </property>
    <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <description>The credential provider type.</description>
    <value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
    </property>
    </configuration>

    You can also use an AWS named profile from the ~/.aws/credentials file. This includes support for specifying a credentials_process to automatically retrieve temporary credentials.

    <?xml version="1.0"?>
    <configuration>
    <property>
    <name>fs.dremioS3.impl</name>
    <description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
    <value>com.dremio.plugins.s3.store.S3FileSystem</value>
    </property>
    <property>
    <name>com.dremio.awsProfile</name>
    <description>AWS Profile to use.</description>
    <value>profilename</value>
    </property>
    <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <description>The credential provider type.</description>
    <value>com.dremio.plugins.s3.store.AWSProfileCredentialsProviderV1</value>
    </property>
    </configuration>

    You can also use an access key and secret for configuring the distributed storage in place of the instance or AWS profile:

    <?xml version="1.0"?>
    <configuration>
    <property>
    <name>fs.dremioS3.impl</name>
    <description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
    <value>com.dremio.plugins.s3.store.S3FileSystem</value>
    </property>
    <property>
    <name>fs.s3a.access.key</name>
    <description>AWS access key ID.</description>
    <value></value>
    </property>
    <property>
    <name>fs.s3a.secret.key</name>
    <description>AWS secret key.</description>
    <value></value>
    </property>
    <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <description>The credential provider type.</description>
    <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
    </property>
    </configuration>
  3. Copy core-site.xml to under Dremio's configuration directory (same as dremio.conf) on all nodes.

Configuring Dremio for Minio

note

In Dremio 24+, the fs.s3a.secret.key property can be encrypted using the dremio-admin encrypt CLI command. The encrypted value produced with the encrypt command must be prefixed with dremio+ (for example, dremio+secret:1.812WZLVORD26pwyAg8qKtQqw9Te8Xom5mdkSMmR_U4).

Minio can be used as a distributed store for both unencrypted and SSL/TLS connections. Before configuring Minio as Dremio's distributed storage, create a storage root directory to use for the dremio.conf and core-site.xml files.

To configure Minio as a distributed store:

  1. Ensure that the provided root directory(bucket) is already created in Minio server.

  2. Change the distributed property in the dremio.conf file.

    paths: {
    ...
    dist: "dremioS3:///<bucket_name>/<folder1>/<folder2>"
    }

    For example: dist: "dremioS3:///dremio"

  3. Create core-site.xml and include IAM credentials with list, read and write permissions:

    <?xml version="1.0"?>
    <configuration>
    <property>
    <name>fs.dremioS3.impl</name>
    <description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
    <value>com.dremio.plugins.s3.store.S3FileSystem</value>
    </property>
    <property>
    <name>fs.s3a.access.key</name>
    <description>Minio server access key ID.</description>
    <value>ACCESS_KEY</value>
    </property>
    <property>
    <name>fs.s3a.secret.key</name>
    <description>Minio server secret key.</description>
    <value>SECRET_KEY</value>
    </property>
    <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <description>The credential provider type.</description>
    <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
    </property>
    <property>
    <name>fs.s3a.endpoint</name>
    <description>Endpoint can either be an IP or a hostname, where Minio server is running . However the endpoint value cannot contain the http(s) prefix. E.g. 175.1.2.3:9000 is a valid endpoint. </description>
    <value>ENDPOINT</value>
    </property>
    <property>
    <name>fs.s3a.path.style.access</name>
    <description>Value has to be set to true.</description>
    <value>true</value>
    </property>
    <property>
    <name>dremio.s3.compat</name>
    <description>Value has to be set to true.</description>
    <value>true</value>
    </property>
    <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <description>Value can either be true or false, set to true to use SSL with a secure Minio server.</description>
    <value>SSL_ENABLED</value>
    </property>
    </configuration>
  4. Copy core-site.xml to under Dremio's configuration directory (same as dremio.conf) on all nodes.

note
  • Troubleshooting Minio: The default directory is /tmp/hadoop-dremio/s3a, if you do not have enough space, you may encounter error like "DiskErrorException: No space available in any of the local directories." As an alternative option, in the core-site.xml file, add or change the fs.s3a.buffer.dir setting to a directory of your choice (any directory with ample space and write permissions for Dremio). This should resolve writing to the default /tmp directory.

  • Until you restart the executors, you'll see that directory is not used for buffering.

Configuring KMS Encryption

Before configuring AWS Key Management Store (KMS) as Dremio's distributed storage, create a storage root directory to use for the dremio.conf and core-site.xml files.

note

KSM Encryption is available in Dremio Enterprise Edition.

To configure Dremio to use KMS encryption for Amazon S3:

  1. Change the distributed property in the dremio.conf file.

    paths: {
    ...
    dist: "dremioS3:///<bucket_name>/<folder1>/<folder2>"
    }

    For example: dist: "dremioS3:///qa1.dremio.com/jduong/accel"

  2. Modify core-site.xml to include the following property:

    ```xml
    <?xml version="1.0"?>
    <configuration>

    <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>true</value>
    </property>

    <property>
    <name>fs.s3a.server-side-encryption-algorithm</name>
    <value>SSE-KMS</value>
    </property>

    <property>
    <name>fs.s3a.server-side-encryption.key</name>
    <value>KEY_ARN</value>
    </property>
    </configuration>
    ```
    note

    To obtain the server-side encryption key from AWS, navigate to KMS > Customer managed keys > Create key.

  3. Copy core-site.xml to under Dremio's configuration directory (same as dremio.conf) on all nodes.

Azure Storage

caution

Soft delete for blobs is not supported for Azure Storage accounts. Soft delete should be disabled to establish a successful connection.

note
  • In Dremio 24+, the dremio.azure.key property can be encrypted using the dremio-admin encrypt CLI command. The encrypted value produced with the encrypt command must be prefixed with dremio+ (for example, dremio+secret:1.812WZLVORD26pwyAg8qKtQqw9Te8Xom5mdkSMmR_U4).

  • In Dremio 24.1+, you can provide the URI of an Azure Key Vault secret as the value for the dremio.azure.key property. Add dremio+azure-key-vault+ as a prefix to the secret URI (for example, dremio+azure-key-vault+https://myvault.vault.azure.net/secrets/mysecret). Dremio connects to Azure Key Vault to fetch the secret to use for the shared access key and does not store the fetched secret. To use Azure Key Vault, you must:

The Azure Storage is the foundation for the ADLS Gen2 service. See Azure Storage for more information.

To set up configuration for distributed storage:

  1. Change the following property in the dremio.conf file. The ALTERNATIVE_STORAGE_ROOT_DIRECTORY is optional and is used as an alternative location for creating sub-directories. If the alternative directory is not specified, the sub-directories are created directly under the FILE_SYSTEM directory. You will first need to create the storage root directory to use for the dremio.conf and core-site.xml files.

    paths: {
    ...
    dist: "dremioAzureStorage://:///<FILE_SYSTEM_NAME>/<ALTERNATIVE_STORAGE_ROOT_DIRECTORY>"
    }
  2. Create core-site.xml and add the following properties:

    <?xml version="1.0"?>
    <configuration>
    <property>
    <name>fs.dremioAzureStorage.impl</name>
    <description>FileSystem implementation. Must always be com.dremio.plugins.azure.AzureStorageFileSystem</description>
    <value>com.dremio.plugins.azure.AzureStorageFileSystem</value>
    </property>
    <property>
    <name>dremio.azure.account</name>
    <description>The name of the storage account.</description>
    <value>ACCOUNT_NAME</value>
    </property>
    <property>
    <name>dremio.azure.key</name>
    <description>The shared access key for the storage account.</description>
    <value>Azure Key Vault secret URI with prefix (in 24.1+ and later versions) or ACCESS_KEY</value>
    </property>
    <property>
    <name>dremio.azure.mode</name>
    <description>The storage account type. Value: STORAGE_V2</description>
    <value>STORAGE_V2</value>
    </property>
    <property>
    <name>dremio.azure.secure</name>
    <description>Boolean option to enable SSL connections. Default: True Value: True/False</description>
    <value>True</value>
    </property>
    </configuration>
  3. Copy the core-site.xml file to the Dremio's configuration directory location (the same location as the dremio.conf file) on all nodes.

Configuring OAuth2.0

note
  • In Dremio 24+, the dremio.azure.clientSecret property can be encrypted using the dremio-admin encrypt CLI command. The encrypted value produced with the encrypt command must be prefixed with dremio+ (for example, dremio+secret:1.812WZLVORD26pwyAg8qKtQqw9Te8Xom5mdkSMmR_U4).

  • In Dremio 24.1+, you can provide the URI of an Azure Key Vault secret as the value for the dremio.azure.clientSecret property. Add dremio+azure-key-vault+ as a prefix to the secret URI (for example, dremio+azure-key-vault+https://myvault.vault.azure.net/secrets/mysecret). Dremio connects to Azure Key Vault to fetch the secret to use for the shared access key and does not store the fetched secret. To use Azure Key Vault, you must:

To enable distributed storage with OAuth 2.0, update the core-site.xml file. See the following sample information for reference:

<?xml version="1.0"?>
<configuration>
<property>
<name>fs.dremioAzureStorage.impl</name>
<description>FileSystem implementation. Must always be com.dremio.plugins.azure.AzureStorageFileSystem</description>
<value>com.dremio.plugins.azure.AzureStorageFileSystem</value>
</property>
<property>
<name>dremio.azure.account</name>
<description>The name of the storage account.</description>
<value>ACCOUNT_NAME</value>
</property>
<property>
<name>dremio.azure.mode</name>
<description>The storage account type. Value: STORAGE_V1 or STORAGE_V2</description>
<value>MODE</value>
</property>
<property>
<name>dremio.azure.secure</name>
<description>Boolean option to enable SSL connections. Default: True, Value: True/False</description>
<value>SECURE</value>
</property>
<property>
<name>dremio.azure.credentialsType</name>
<description>The credentials used for authentication. Value: ACCESS_KEY or AZURE_ACTIVE_DIRECTORY</description>
<value>CREDENTIALS_TYPE</value>
</property>
<property>
<name>dremio.azure.clientId</name>
<description>The client ID of the Azure application used for Azure Active Directory</description>
<value>CLIENT_ID</value>
</property>
<property>
<name>dremio.azure.tokenEndpoint</name>
<description>OAuth 2.0 token endpoint for Azure Active Directory(v1.0)</description>
<value>TOKEN_ENDPOINT</value>
</property>
<property>
<name>dremio.azure.clientSecret</name>
<description>The client secret of the Azure application used for Azure Active Directory</description>
<value>Azure Key Vault secret URI with prefix (in 24.1+ and later versions) or CLIENT_SECRET</value>
</property>
</configuration>

Configuring OAuth 2.0 with Azure Government Cloud

To use OAuth 2.0 authentication with Azure Government cloud platform, add the following property to the core-site.xml:

<property>
<name>fs.azure.endpoint</name>
<description>Azure Government Cloud Endpoint</description>
<value>GOVERNMENT_CLOUD_ENDPOINT</value>
</property>

Azure Government

To configure the Azure Storage data source to access data on the Azure Government cloud platform, add the fs.azure.endpoint property to the core-site.xml file along with the general Azure Storage properties and copy the core-site.xml file to the Dremio's configuration directory location on all nodes..

  • For Storage V2:

    <property>
    <name>fs.azure.endpoint</name>
    <description>The azure storage endpoint to use.</description>
    <value>dfs.core.usgovcloudapi.net</value>
    </property>
  • For Storage V1:

    <property>
    <name>fs.azure.endpoint</name>
    <description>The azure storage endpoint to use.</description>
    <value>blob.core.usgovcloudapi.net</value>
    </property>

This configuration is done in addition to the configuration done via the UI. See Azure Storage for more information.

Google Cloud Storage

Google Cloud Storage (GCS) is a globally distributed object storage service provided by Google Cloud Platform. See Google Cloud Storage for more information.

Before configuring GCS as Dremio's distributed storage, create a storage root directory to use for the dremio.conf and core-site.xml files.

To set up configuration for distributed storage:

  1. Change the distributed property in the dremio.conf file.

    paths: {
    ...
    dist: "dremiogcs:///<bucket_name>/<folder1>"
    }

    For example: dist: "dremiogcs:///dremio/dremiodiststorage"

  2. Create core-site.xml and add the following properties:
    To use Google application default credentials to authenticate to the GCS bucket, configure the parameters below.

    <?xml version="1.0"?>
    <configuration>
    <property>
    <name>dremio.gcs.whitelisted.buckets</name>
    <description>GCS bucket to use for distributed storage</description>
    <value>BUCKET_NAME</value>
    </property>
    <property>
    <name>fs.dremiogcs.impl</name>
    <description>The FileSystem implementation. Must be set to com.dremio.plugins.gcs.GoogleBucketFileSystem</description>
    <value>com.dremio.plugins.gcs.GoogleBucketFileSystem</value>
    </property>
    <property>
    <name>dremio.gcs.use_keyfile</name>
    <description>Do not use the key file</description>
    <value>false</value>
    </property>
    </configuration>

    You can also use a client ID and secret for configuring the distributed storage in place of the instance or default credentials:

    <?xml version="1.0"?>
    <configuration>
    <property>
    <name>dremio.gcs.whitelisted.buckets</name>
    <description>GCS bucket to use for distributed storage</description>
    <value>BUCKET_NAME</value>
    </property>
    <property>
    <name>fs.dremiogcs.impl</name>
    <description>The FileSystem implementation. Must be set to com.dremio.plugins.gcs.GoogleBucketFileSystem</description>
    <value>com.dremio.plugins.gcs.GoogleBucketFileSystem</value>
    </property>
    <property>
    <name>dremio.gcs.use_keyfile</name>
    <description>Use the key file</description>
    <value>true</value>
    </property>
    <property>
    <name>dremio.gcs.projectId</name>
    <description>GCP Project ID</description>
    <value>GCP_PROJECT_ID</value>
    </property>
    <property>
    <name>dremio.gcs.clientId</name>
    <description>GCP Service Account Client ID</description>
    <value>SERVICE_ACCOUNT_CLIENT_ID</value>
    </property>
    <property>
    <name>dremio.gcs.clientEmail</name>
    <description>GCP Service Account Client Email</description>
    <value>SERVICE_ACCOUNT_EMAIL</value>
    </property>
    <property>
    <name>dremio.gcs.privateKeyId</name>
    <description>GCP Service Account Private Key ID</description>
    <value>PRIVATE_KEY</value>
    </property>
    <property>
    <name>dremio.gcs.privateKey</name>
    <description>GCP Service Account Private Key</description>
    <value>SECRET_KEY</value>
    </property>
    </configuration>
  3. Copy the core-site.xml file to the Dremio's configuration directory location (the same location as the dremio.conf file) on all nodes.

HDFS

Before configuring HDFS as Dremio's distributed storage, test adding the same cluster as a Dremio source and verify the connection.

The following are dremio.conf file changes:

paths: {
...
dist: "hdfs://<NAMENODE_HOST>:8020/path"}

When deploying on Hadoop using YARN, Dremio automatically copies this option to all nodes. So it only needs to be configured manually on coordinator nodes.

Name Node HA
If Name Node HA is enabled, when specifying distributed storage (paths.dist in dremio.conf), path should be specific using fs.defaultFS value instead of the active name node. (e.g. <value_for_fs_defaultFS>/path)

fs.defaultFS value can be found in core-site.xml (typically found under /etc/hadoop/conf).

As per Hadoop using YARN deployment guide, ensure that you've copied core-site.xml, hdfs-site.xml, and yarn-site.xml (typically under /etc/hadoop/conf) files into Dremio's conf directory.

MapR-FS

Before configuring MapR-FS as Dremio's distributed storage, test adding the same cluster as a Dremio source and verify the connection.

The following are dremio.conf file changes:

paths: {
...
dist: "maprfs:///<MOUNT_PATH>/<CACHE_DIRECTORY>"
}

When deploying on MapR using YARN, Dremio automatically copies this option to all nodes. So it only needs to be configured manually on coordinator nodes.

NAS

NAS (network attached storage) is a device that serves files via a network using a protocol such as NFS. Dremio supports the NFS protocol with NAS.

NFS protocol typeSupported?
NetappYes
MapR NFS sharesNo
Window file sharesNo

To configure NAS as Dremio's distributed storage, add the distributed path to the dremio.conf file:

paths: {
...
dist: "file:///shared_mount_path"
}

Relocating Distributed Storage and Metadata

In versions 24.1.4+ and 24.2.0+, Dremio supports relocating the distributed storage and metadata of Parquet tables (internal Iceberg tables). If you utilize Parquet tables with unlimited splits, this feature enables you to reconfigure the distributed storage and metadata to move to a new location, eliminating the need to forget and repromote your tables.

caution

Distributed storage relocation does not currently support reflections.

Relocation is supported in the following distributed storage configurations:

  • Amazon S3
  • Azure Storage
  • Google Cloud Storage
  • HDFS
  • NAS
note

Distributed storage relocation is supported only within the same source type. For example, you can move your distributed storage from one S3 bucket to another S3 bucket, but you cannot relocate from a NAS source to an S3 bucket.

Follow the steps below to relocate distributed storage and its metadata:

  1. Update paths.dist in dremio.conf:

    a. Locate your dremio.conf file and open it with a text editor.

    b. Find the paths: section in the file, which contains local: and dist: variables. The value after dist: represents your current distributed storage configuration.

    c. Change the value of dist: to the new location, making sure to note the original value you are replacing.

    d. Save the file, then repeat these instructions for all executor and coordinator nodes in your Dremio instance.

  2. Restart Dremio.

  3. Move your metadata contents to the new distributed storage:

    a. Identify the file path of the original distributed storage variable.

    b. Locate the /metadata subdirectory following your paths.dist value.

    c. Copy or move the /metadata directory (including all subdirectories) from the old distributed storage location to the new paths.dist directory. This process replaces the empty /metadata directory with your existing metadata content.

    The steps for moving metadata will vary according to your distributed storage source type.

    • Amazon S3: Utilize the Move/Copy button within the AWS Console or CLI.
    • Azure Storage: There are many ways to relocate an Azure container. We recommend using azcopy.
    • Google Cloud Storage: There are many ways to relocate an files within GCS. We recommend using gsutil.
    • HDFS: Use your Hadoop / HDFS command line tools, ensuring the correct permissions for relocating files.
    • NAS: Use your computer's internal GUID or command line.
caution

If your Dremio instance is in an invalid homestate after completing the steps above, try restarting Dremio again to pick up the path relocation changes.