Azure Storage

The Dremio Azure Storage Connector includes support for the following Azure Storage services:

Azure Blob Storage
Azure Blob storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data.

Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on top of Azure Blob storage, and converges the capabilities of Azure Blob Storage and Azure Data Lake Storage Gen1. Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with the low-cost, tiered storage, high availability/disaster recovery capabilities of Azure Blob storage.

Configuration

General

Dremio Field Azure Property Description
Resource Name Name Name of the Azure Storage source.
Connection Account Name Name of the Azure Storage account.
Account Kind Select Azure Storage version for this source connection. Default: StorageV2
Encrypt connection Select to encrypt network traffic with TLS.
Authentication Shared Access Key Use an Azure Shared Access Key for authentication.
Azure Active Directory Use Azure Active Directory credentials for authentication.

Azure Storage General Settings

Azure Active Directory Authentication

To configure the Azure Storage to use Azure Active Directory for Authentication, provide the following values:

  • Application ID - The Application (Client) ID in Azure.
  • OAuth 2.0 Token Endpoint - The OAuth 2.0 token endpoint (v1.0).
  • Client Secret - The secret key generated for the application.

Azure Storage Authentication Settings

To obtain the Azure Active Directory configuration values:

  1. Log in to the Azure Portal.
  2. Navigate to App Registrations.
  3. If not already done, create an app for OAuth 2.0.

The configuration values are available from the Portal.

Advanced Options

Advanced Options include:

  • Enable asynchronous access when possible (default)
  • Enable exports into the source (CTAS and DROP).
  • Root Path -- Root path for the source.
  • Advanced Properties -- A list of connection properties (name and value).
  • Blob containers and Filesystem Whitelist -- Specifies a list of containers to include. Note this disables automatic container and filesystem discovery and Dremio will limit containers and filesystems available to the ones provided.
  • Cache Options
    • Enable Columnar Cloud Cache when possible
    • Max percent of total available cache space to use when possible

Azure Storage Advanced Options Settings

Reflection Refresh

The Reflection refresh policy contols how often Reflections are scheduled to automatically refresh and the time limit after which reflections expire and are removed.

Refresh Settings

  • Never refresh - Select to prevent automatic reflection refresh, default is to automatically refresh.
  • Refresh every - How often to refresh reflections, specified in hours, days or weeks. This option is ignored if Never refresh is selected.

Expire Settings

  • Never expire - Select to prevent reflections from expiring, default is to automatically expire after the time limit below.
  • Expire after - The time limit after which reflections expire and are removed from Dremio, specified in hours, days or weeks. This option is ignored if Never Expire is selected.

Azure Storage Reflections Refresh Settings

Metadata

Metadata settings include:

  • Dataset Handling options
  • Metadata Refresh options

Dataset Handling

  • Remove dataset definitions if underlying data is unavailable (Default).
    • When selected datasets are automatically removed if the underlying folders/files for a dataset are removed from Azure Storage or if the folder or source are not accessible. When not selected Dremio will not remove dataset definitions even if the underlying files/folder are removed from Azure Storage, this option is useful if files are temporarily deleted and replaced with a new set of files.
  • Automatically format files into physical datasets when users issue queries.
    • When selected Dremio will automaticall promo a folder to a PDS using default options. If you have CSV files, especially with non-default formatting, it might be useful to not select this option.

Metadata Refresh

  • Dataset Discovery -- Refresh interval for top-level objects including folders and physical datasets.
    • Fetch every -- How often to perform dataset discovery, specified in hours, days or weeks. Default: 1 hour
  • Dataset Details -- Refresh interval to gather detailed information on promoted PDS tables, including fields, data types, shards, statistics, and locality. This infomation is used during query planning and optimization.
    • Fetch mode -- Select either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
      • Only Queried Datasets -- Only update details for previously queried objects in a source. This option increases query performance because less work is needed at query time for these datasets.
      • All Datasets -- Updates details for all datasets in a source. This option increases query performance because less work is needed at query time.
      • As Needed -- Only update details for a dataset at query runtime. This option minimizes metadata refresh when datasets are not used, but can lead to longer planning times if metadata needs to be refreshed at runtime.
    • Fetch every -- How often to refresh dataset details, specified in minutes, hours, days, or weeks. Default: 1 hour
    • Expire after -- Time limit to expire dataset details, specified in minutes, hours, days, or weeks. Default: 3 hours

Azure Storage Metadata Settings

Sharing

Sharing options for which users can edit datasets in the source:

  • All users can edit -- All users can edit datasets in the source
  • Specific users -- Only specified users can edit datasets in the source See Sharing and Permissions for additional information on Sharing.

Azure Storage Sharing Settings

Distributed Storage

See Configuring Distributed Storage for information to configure Azure Storage as a distribute storage source.

Azure Government

To configure Azure Storage for the Azure Government platform add one of the following properties to the Advanced Options tab under Advanced Properties, depending on if the Azure Storage source is of Account Kind Storage V1 or Storage V2.

  • Storage V1 -- Add the following property and value if the Azure Storage source is of Account Kind Storage V1 fs.azure.endpoint = blob.core.usgovcloudapi.net
  • Storage V2 -- Add the following property and value if the Azure Storage source is of Account Kind Storage V2 fs.azure.endpoint = dfs.core.usgovcloudapi.net

Columnar Cloud Cache

Azure Storage supports Columnar Cloud Cache as of Dremio 4.0

See Cloud Cache and Configuring Cloud Cache for additional information.

For More Information


results matching ""

    No results matching ""