Azure Storage
The Dremio Azure Storage Connector includes support for the following Azure Storage services:
Azure Blob Storage Azure Blob storage is Microsoft’s object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data.
Azure Data Lake Storage Gen2 Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on top of Azure Blob storage, and converges the capabilities of Azure Blob Storage and Azure Data Lake Storage Gen1. Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with the low-cost, tiered storage, high availability/disaster recovery capabilities of Azure Blob storage.
Configuration
General
Dremio Field | Azure Property | Description |
---|---|---|
Resource Name | Name | Name of the Azure Storage source. |
Connection | Account Name | Name of the Azure Storage account. |
Account Kind | Select Azure Storage version for this source connection. Default: StorageV2 | |
Encrypt connection | Select to encrypt network traffic with TLS. | |
Authentication | Shared Access Key | Use an Azure Shared Access Key for authentication. |
Azure Active Directory | Use Azure Active Directory credentials for authentication. |
Azure Active Directory Authentication
To configure the Azure Storage to use Azure Active Directory for Authentication, provide the following values:
- Application ID - The Application (Client) ID in Azure.
- OAuth 2.0 Token Endpoint - The OAuth 2.0 token endpoint (v1.0).
- Client Secret - The secret key generated for the application.
To obtain the Azure Active Directory configuration values:
- Log in to the Azure Portal.
- Navigate to App Registrations.
- If not already done, create an app for OAuth 2.0.
The configuration values are available from the Portal.
Advanced Options
Advanced Options include:
- Enable asynchronous access when possible (default)
- Enable exports into the source (CTAS and DROP).
- Root Path – Root path for the source.
- Advanced Properties – A list of connection properties (name and value).
- Blob containers and Filesystem Whitelist – Specifies a list of containers to include. Note this disables automatic container and filesystem discovery and Dremio will limit containers and filesystems available to the ones provided.
- Cache Options
- Enable Columnar Cloud Cache when possible
- Max percent of total available cache space to use when possible
Reflection Refresh
The Reflection refresh policy contols how often Reflections are scheduled to automatically refresh and the time limit after which reflections expire and are removed.
Refresh Settings
- Never refresh - Select to prevent automatic reflection refresh, default is to automatically refresh.
- Refresh every - How often to refresh reflections, specified in hours, days or weeks. This option is ignored if Never refresh is selected.
Expire Settings
- Never expire - Select to prevent reflections from expiring, default is to automatically expire after the time limit below.
- Expire after - The time limit after which reflections expire and are removed from Dremio, specified in hours, days or weeks. This option is ignored if Never Expire is selected.
Metadata
Metadata settings include:
- Dataset Handling options
- Metadata Refresh options
Dataset Handling
- Remove dataset definitions if underlying data is unavailable (Default).
- When selected datasets are automatically removed if the underlying folders/files for a dataset are removed from Azure Storage or if the folder or source are not accessible. When not selected Dremio will not remove dataset definitions even if the underlying files/folder are removed from Azure Storage, this option is useful if files are temporarily deleted and replaced with a new set of files.
- Automatically format files into physical datasets when users issue queries.
- When selected Dremio will automaticall promo a folder to a PDS using default options. If you have CSV files, especially with non-default formatting, it might be useful to not select this option.
Metadata Refresh
- Dataset Discovery – Refresh interval for top-level objects including folders and physical datasets.
- Fetch every – How often to perform dataset discovery, specified in hours, days or weeks. Default: 1 hour
- Dataset Details – Refresh interval to gather detailed information on promoted PDS tables, including fields, data types, shards, statistics, and locality. This infomation is used during query planning and optimization.
- Fetch mode – Select either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
- Only Queried Datasets – Only update details for previously queried objects in a source. This option increases query performance because less work is needed at query time for these datasets.
- All Datasets – Updates details for all datasets in a source. This option increases query performance because less work is needed at query time.
- As Needed – Only update details for a dataset at query runtime. This option minimizes metadata refresh when datasets are not used, but can lead to longer planning times if metadata needs to be refreshed at runtime.
- Fetch every – How often to refresh dataset details, specified in minutes, hours, days, or weeks. Default: 1 hour
- Expire after – Time limit to expire dataset details, specified in minutes, hours, days, or weeks. Default: 3 hours
- Fetch mode – Select either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
Sharing
Sharing options for which users can edit datasets in the source. Access controls allow for the application of privileges to users or groups at the object level. With these permissions, administrators decide who has sufficient permissions to view and alter data. See Access Controls for additional information on user privileges.
Distributed Storage
See Configuring Distributed Storage for information to configure Azure Storage as a distribute storage source.
Azure Government
To configure Azure Storage for the Azure Government platform add one of the following properties to the Advanced Options tab under Advanced Properties, depending on if the Azure Storage source is of Account Kind Storage V1 or Storage V2.
-
Storage V1 – Add the following property and value if the Azure Storage source is of Account Kind Storage V1
Property and value for Storage V1fs.azure.endpoint = blob.core.usgovcloudapi.net
-
Storage V2 – Add the following property and value if the Azure Storage source is of Account Kind Storage V2
Property and value for Storage V2fs.azure.endpoint = dfs.core.usgovcloudapi.net
Columnar Cloud Cache
Azure Storage supports Columnar Cloud Cache.
See Configuring Cloud Cache for additional information.