On this page

    Azure Storage

    The Dremio Azure Storage Connector includes support for the following Azure Storage services:

    Azure Blob Storage Azure Blob storage is Microsoft’s object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data.

    Azure Data Lake Storage Gen2 Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on top of Azure Blob storage, and converges the capabilities of Azure Blob Storage and Azure Data Lake Storage Gen1. Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with the low-cost, tiered storage, high availability/disaster recovery capabilities of Azure Blob storage.

    Configuration

    General

    Dremio Field Azure Property Description
    Resource Name Name Name of the Azure Storage source.
    Connection Account Name Name of the Azure Storage account.
    Account Kind Select Azure Storage version for this source connection. Default: StorageV2
    Encrypt connection Select to encrypt network traffic with TLS.
    Authentication Shared Access Key Use an Azure Shared Access Key for authentication.
    Azure Active Directory Use Azure Active Directory credentials for authentication.

    Azure Active Directory Authentication

    To configure the Azure Storage to use Azure Active Directory for Authentication, provide the following values:

    • Application ID - The Application (Client) ID in Azure.
    • OAuth 2.0 Token Endpoint - The OAuth 2.0 token endpoint (v1.0).
    • Client Secret - The secret key generated for the application.

    To obtain the Azure Active Directory configuration values:

    1. Log in to the Azure Portal.
    2. Navigate to App Registrations.
    3. If not already done, create an app for OAuth 2.0.

    The configuration values are available from the Portal.

    Advanced Options

    Advanced Options include:

    • Enable asynchronous access when possible (default)
    • Enable exports into the source (CTAS and DROP).
    • Root Path – Root path for the source.
    • Advanced Properties – A list of connection properties (name and value).
    • Blob containers and Filesystem Whitelist – Specifies a list of containers to include. Note this disables automatic container and filesystem discovery and Dremio will limit containers and filesystems available to the ones provided.
    • Cache Options
      • Enable Columnar Cloud Cache when possible
      • Max percent of total available cache space to use when possible

    Reflection Refresh

    The Reflection refresh policy contols how often Reflections are scheduled to automatically refresh and the time limit after which reflections expire and are removed.

    Refresh Settings

    • Never refresh - Select to prevent automatic reflection refresh, default is to automatically refresh.
    • Refresh every - How often to refresh reflections, specified in hours, days or weeks. This option is ignored if Never refresh is selected.

    Expire Settings

    • Never expire - Select to prevent reflections from expiring, default is to automatically expire after the time limit below.
    • Expire after - The time limit after which reflections expire and are removed from Dremio, specified in hours, days or weeks. This option is ignored if Never Expire is selected.

    Metadata

    Metadata settings include:

    • Dataset Handling options
    • Metadata Refresh options

    Dataset Handling

    • Remove dataset definitions if underlying data is unavailable (Default).
      • When selected datasets are automatically removed if the underlying folders/files for a dataset are removed from Azure Storage or if the folder or source are not accessible. When not selected Dremio will not remove dataset definitions even if the underlying files/folder are removed from Azure Storage, this option is useful if files are temporarily deleted and replaced with a new set of files.
    • Automatically format files into physical datasets when users issue queries.
      • When selected Dremio will automaticall promo a folder to a PDS using default options. If you have CSV files, especially with non-default formatting, it might be useful to not select this option.

    Metadata Refresh

    • Dataset Discovery – Refresh interval for top-level objects including folders and physical datasets.
      • Fetch every – How often to perform dataset discovery, specified in hours, days or weeks. Default: 1 hour
    • Dataset Details – Refresh interval to gather detailed information on promoted PDS tables, including fields, data types, shards, statistics, and locality. This infomation is used during query planning and optimization.
      • Fetch mode – Select either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
        • Only Queried Datasets – Only update details for previously queried objects in a source. This option increases query performance because less work is needed at query time for these datasets.
        • All Datasets – Updates details for all datasets in a source. This option increases query performance because less work is needed at query time.
        • As Needed – Only update details for a dataset at query runtime. This option minimizes metadata refresh when datasets are not used, but can lead to longer planning times if metadata needs to be refreshed at runtime.
      • Fetch every – How often to refresh dataset details, specified in minutes, hours, days, or weeks. Default: 1 hour
      • Expire after – Time limit to expire dataset details, specified in minutes, hours, days, or weeks. Default: 3 hours

    Sharing

    Sharing options for which users can edit datasets in the source. Access controls allow for the application of privileges to users or groups at the object level. With these permissions, administrators decide who has sufficient permissions to view and alter data. See Access Controls for additional information on user privileges.

    Distributed Storage

    See Configuring Distributed Storage for information to configure Azure Storage as a distribute storage source.

    Azure Government

    To configure Azure Storage for the Azure Government platform add one of the following properties to the Advanced Options tab under Advanced Properties, depending on if the Azure Storage source is of Account Kind Storage V1 or Storage V2.

    • Storage V1 – Add the following property and value if the Azure Storage source is of Account Kind Storage V1

      Property and value for Storage V1
       fs.azure.endpoint = blob.core.usgovcloudapi.net
      
    • Storage V2 – Add the following property and value if the Azure Storage source is of Account Kind Storage V2

      Property and value for Storage V2
       fs.azure.endpoint = dfs.core.usgovcloudapi.net
      

    Columnar Cloud Cache

    Azure Storage supports Columnar Cloud Cache.

    See Configuring Cloud Cache for additional information.

    For More Information