Azure Storage
The Dremio Azure Storage Connector includes support for the following Azure Storage services:
Azure Blob Storage Azure Blob storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data.
Azure Data Lake Storage Gen2 Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on top of Azure Blob storage, and converges the capabilities of Azure Blob Storage and Azure Data Lake Storage Gen1. Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with the low-cost, tiered storage, high availability/disaster recovery capabilities of Azure Blob storage.
Soft delete for blobs is not supported for Azure Storage accounts. Soft delete should be disabled to establish a successful connection.
If you see 0 byte files being created with your Iceberg tables in your Azure Storage account, these files do not impact Dremio’s functionality and can be ignored if you cannot update your storage container. If you can update your container, see Azure Data Lake Storage Gen2 hierarchical namespace for more information on how to enable Hierarchical Namespace to prevent the creation of these files.
Configuring Azure Storage as a Source
- On the Datasets page, to the right of Sources in the left panel, click .
- In the Add Data Source dialog, under Object Storage, select Azure Storage.
General
- Name: Name to use for the Azure Storage source. The name cannot include the following special characters:
/
,:
,[
, or]
.
Connection
- Account Name: Name of the Azure Storage account.
- Encrypt connection: Select to encrypt network traffic over SSL.
- Account Version: Version of Azure Storage. The options are StorageV1 and Storage V2. The default is Storage V2.
Authentication
For Authentication Type, select Shared access key or Microsoft Entra ID.
If you select Shared access key authentication, select the secret store method from the dropdown menu:
-
Dremio: Provide the shared access key in plain text. Dremio stores the key.
-
Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the shared access key. The URI format is
https://<vault_name>.vault.azure.net/secrets/<secret_name>
(for example,https://myvault.vault.azure.net/secrets/mysecret
). Dremio connects to Azure Key Vault and fetches the secret to use as the shared access key. Dremio does not store the fetched secret.noteTo use Azure Key Vault as your application secret store, you must:
- Deploy Dremio on Azure.
- Complete the Requirements for Authenticating with Azure Key Vault.It is not necessary to restart the Dremio coordinator when you rotate secrets stored in Azure Key Vault. Read Requirements for Secrets Rotation for more information.
-
AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the shared access key, which is available in the AWS web console or using command line tools.
-
HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and enter the secret reference for the shared access key in the correct format in the provided field.
If you select Microsoft Entra ID authentication:
-
For Application ID, specify the Application (Client) ID in Microsoft Entra ID.
-
For OAuth 2.0 Token Endpoint, specify the OAuth 2.0 token endpoint for your Azure application.
-
For Application Secret Store, select select the secret store method from the dropdown menu:
-
Dremio: Provide the shared access key in plain text. Dremio stores the key.
-
Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the shared access key. The URI format is
https://<vault_name>.vault.azure.net/secrets/<secret_name>
(for example,https://myvault.vault.azure.net/secrets/mysecret
). Dremio connects to Azure Key Vault and fetches the secret to use as the shared access key. Dremio does not store the fetched secret.noteTo use Azure Key Vault as your application secret store, you must:
- Deploy Dremio on Azure.
- Complete the Requirements for Authenticating with Azure Key Vault.It is not necessary to restart the Dremio coordinator when you rotate secrets stored in Azure Key Vault. Read Requirements for Secrets Rotation for more information.
-
AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the shared access key, which is available in the AWS web console or using command line tools.
-
HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and enter the secret reference for the shared access key in the correct format in the provided field.
-
Requirements for Authenticating with Azure Key Vault
Dremio uses Microsoft Entra ID managed identities to connect to Azure Key Vault. Follow the Microsoft Entra ID instructions linked from the steps below to ensure that Dremio can connect to Azure Key Vault for authentication when you create an Azure Storage source:
- Create a user-assigned managed identity in Microsoft Entra ID.
- Assign the managed identity to the Dremio coordinator and executor virtual machines (VMs).
- Assign the Azure Key Vault access policy to allow access to the managed identity.
- Add a secret in the Azure Key Vault whose value is either the shared access key or application secret, depending on the authentication type you select, that Dremio requires to connect to your Azure Storage source.
Requirements for Secrets Rotation
For seamless rotation of secrets stored in Azure Key Vault, the rotation must be done with two secrets. After the Azure Key Vault secret value is updated, both secrets must remain valid for the minimum holdover period:
- Plain secrets: 5 minutes
- Microsoft Entra ID client secrets: 90 minutes
You may invalidate the old secret when the holdover period expires.
It is not necessary to restart the Dremio coordinator when you rotate secrets stored in Azure Key Vault.
Advanced Options
- Enable asynchronous access when possible: Select to enable cloud caching so that the Azure Storage source can support simultaneous actions like adding and editing new sources.
- Enable partition column inference: Select if Dremio should use partition column inference to handle partition columns.
- Root Path: Root path for the source. The default is
/
. - Advanced Properties: Add connection properties, specifying their names and values.
- Blob Containers & Filesystem Allowlist Add the names of containers to include in the source. This setting disables automatic container and filesystem discovery. Dremio limits the available containers and filesystems to those you add to the allowlist.
Cache Options
- Enable local caching when possible: Select to create local caches of any data used from the source. Read Configuring Cloud Caching for more information.
- Max percent of total available cache space to use when possible: Maximum amount of cache space, as a percentage, that a source can use on any single executor node when local caching is enabled The default value is
100
.
Reflection Refresh
The reflection refresh options control how often Dremio refreshes reflections automatically and the time limit after which reflections expire and are removed.
Refresh Policy
- Never refresh: Select to prevent the automatic refresh of reflections. The default is to allow automatic refreshes.
- Refresh every: If using automatic refresh, how often to refresh reflections, specified in minutes, hours, days, or weeks. The default is 1 hour. Ignored if you select Never refresh.
- Never expire: Select to prevent the expiration of reflections. The default is expiration after the specified time limit.
- Expire after: Time limit after which reflections expire and are removed from Dremio, specified in minutes, hours, days, or weeks. The default is 3 hours. Ignored if you select Never expire.
Metadata
Metadata settings include options for dataset handling and metadata refresh.
Dataset Handling
- Remove dataset definitions if underlying data is unavailable: Select to automatically remove datasets if their underlying files and folders are removed from Azure Storage or if the folder or source is not accessible. This option is selected by default. If not selected, Dremio does not remove dataset definitions even if their underlying files and folders are removed from Azure Storage, which is useful when files are temporarily deleted and replaced with a new set of files.
- Automatically format files into tables when users issue queries: Select to automatically promote folders to tables using the default options when a user runs a query on the folder data for the first time. This option is not selected by default. For Azure Storage sources that contain CSV files, especially CSV files with non-default formatting, consider leaving this option unselected.
Metadata Refresh
Metadata Refresh settings allow you to configure the refresh interval for gathering detailed information about promoted tables, including fields, data types, shards, statistics, and locality. Dremio uses this information during query planning and optimization.
- Fetch mode: The default is Only Queried Datasets, which only updates details only for previously queried objects in a source. This option increases query performance because the datasets require less work at query time. Other options are deprecated.
- Fetch every: How often to refresh dataset details, specified in minutes, hours, days, or weeks. The default is 1 hour.
- Expire after: Time limit after which dataset details expire, specified in minutes, hours, days, or weeks. The default is 3 hours.
Privileges
On the Privileges tab, you can grant privileges to specific users or roles. See Access Controls for additional information about privileges.
All privileges are optional.
- For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
- For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.
Updating an Azure Storage Source
To update an Azure Storage source:
- On the Datasets page, under Object Storage in the panel on the left, find the name of the source you want to edit.
- Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the at the top right corner of the page.
- In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name. For information about the settings options, see Configuring Azure Storage as a Source.
- Click Save.
Deleting an Azure Storage Source
If the source is in a bad state (for example, Dremio cannot authenticate to the source or the source is otherwise unavailable), only users who belong to the ADMIN role can delete the source.
To delete an Azure Storage source, perform these steps:
- On the Datasets page, click Sources > Object Storage in the panel on the left.
- In the list of data sources, hover over the name of the source you want to remove and right-click.
- From the list of actions, click Delete.
- In the Delete Source dialog, click Delete to confirm that you want to remove the source.
Deleting a source causes all downstream views that depend on objects in the source to break.
Distributed Storage
See Configuring Distributed Storage for information to configure Azure Storage as a distributed storage source.
Azure Government
To configure Azure Storage for the Azure Government platform, add one of the following properties to the Advanced Options tab under Advanced Properties:
-
Storage V1: Add the following property and value if the Azure Storage source is of Account Kind Storage V1
Property and value for Storage V1fs.azure.endpoint = blob.core.usgovcloudapi.net
-
Storage V2: Add the following property and value if the Azure Storage source is of Account Kind Storage V2
Property and value for Storage V2fs.azure.endpoint = dfs.core.usgovcloudapi.net
Columnar Cloud Cache
Azure Storage supports Columnar Cloud Cache.