The Dremio Azure Storage Connector includes support for the following Azure Storage services:
Azure Blob Storage
Azure Blob storage is Microsoft’s object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data.
Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on top of Azure Blob storage, and converges the capabilities of Azure Blob Storage and Azure Data Lake Storage Gen1. Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with the low-cost, tiered storage, high availability/disaster recovery capabilities of Azure Blob storage.
||Name of the Azure Storage source.
||Name of the Azure Storage account.
||Select Azure Storage version for this source connection. Default: StorageV2
||Select to encrypt network traffic with TLS.
||Shared Access Key
||Use an Azure Shared Access Key for authentication.
||Azure Active Directory
||Use Azure Active Directory credentials for authentication.
Azure Active Directory Authentication
To configure the Azure Storage to use Azure Active Directory for Authentication, provide the following values:
- Application ID - The Application (Client) ID in Azure.
- OAuth 2.0 Token Endpoint - The OAuth 2.0 token endpoint (v1.0).
- Client Secret - The secret key generated for the application.
To obtain the Azure Active Directory configuration values:
- Log in to the Azure Portal.
- Navigate to App Registrations.
- If not already done, create an app for OAuth 2.0.
The configuration values are available from the Portal.
Advanced Options include:
- Enable asynchronous access when possible (default)
- Enable exports into the source (CTAS and DROP).
- Root Path – Root path for the source.
- Advanced Properties – A list of connection properties (name and value).
- Blob containers and Filesystem Whitelist – Specifies a list of containers to include.
Note this disables automatic container and filesystem discovery and Dremio will limit containers and filesystems available to the ones provided.
- Cache Options
- Enable Columnar Cloud Cache when possible
- Max percent of total available cache space to use when possible
The Reflection refresh policy contols how often Reflections are scheduled to automatically refresh and the time limit after which reflections expire and are removed.
- Never refresh - Select to prevent automatic reflection refresh, default is to automatically refresh.
- Refresh every - How often to refresh reflections, specified in hours, days or weeks. This option is ignored if Never refresh is selected.
- Never expire - Select to prevent reflections from expiring, default is to automatically expire after the time limit below.
- Expire after - The time limit after which reflections expire and are removed from Dremio, specified in hours, days or weeks. This option is ignored if Never Expire is selected.
Metadata settings include:
- Dataset Handling options
- Metadata Refresh options
- Remove dataset definitions if underlying data is unavailable (Default).
- When selected datasets are automatically removed if the underlying folders/files for a dataset are removed from Azure Storage or if the folder or source are not accessible. When not selected Dremio will not remove dataset definitions even if the underlying files/folder are removed from Azure Storage, this option is useful if files are temporarily deleted and replaced with a new set of files.
- Automatically format files into physical datasets when users issue queries.
- When selected Dremio will automaticall promo a folder to a PDS using default options. If you have CSV files, especially with non-default formatting, it might be useful to not select this option.
- Dataset Discovery – Refresh interval for top-level objects including folders and physical datasets.
- Fetch every – How often to perform dataset discovery, specified in hours, days or weeks. Default: 1 hour
- Dataset Details – Refresh interval to gather detailed information on promoted PDS tables, including fields, data types, shards, statistics, and locality. This infomation is used during query planning and optimization.
- Fetch mode – Select either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
- Only Queried Datasets – Only update details for previously queried objects in a source. This option increases query performance because less work is needed at query time for these datasets.
- All Datasets – Updates details for all datasets in a source. This option increases query performance because less work is needed at query time.
- As Needed – Only update details for a dataset at query runtime. This option minimizes metadata refresh when datasets are not used, but can lead to longer planning times if metadata needs to be refreshed at runtime.
- Fetch every – How often to refresh dataset details, specified in minutes, hours, days, or weeks. Default: 1 hour
- Expire after – Time limit to expire dataset details, specified in minutes, hours, days, or weeks. Default: 3 hours
Sharing options for which users can edit datasets in the source:
- All users can edit – All users can edit datasets in the source
- Specific users – Only specified users can edit datasets in the source
See Sharing and Permissions for additional information on Sharing.
See Configuring Distributed Storage for information to configure Azure Storage as a distribute storage source.
To configure Azure Storage for the Azure Government platform add one of the following properties to the Advanced Options tab under Advanced Properties, depending on if the Azure Storage source is of Account Kind Storage V1 or Storage V2.
- Storage V1 – Add the following property and value if the Azure Storage source is of Account Kind Storage V1
fs.azure.endpoint = blob.core.usgovcloudapi.net
- Storage V2 – Add the following property and value if the Azure Storage source is of Account Kind Storage V2
fs.azure.endpoint = dfs.core.usgovcloudapi.net
Columnar Cloud Cache
Azure Storage supports Columnar Cloud Cache as of Dremio 4.0
See Configuring Cloud Cache
for additional information.
For More Information