Skip to main content
Version: current [26.x Preview]

Dremio Catalog Enterprise

Dremio's built-in lakehouse catalog is built on Apache Polaris (incubating). The catalog enables centralized, secure read and write access to your Iceberg tables across different REST-compatible query engines, and automates data maintenance operations to maximize query performance. Key features include:

  • Iceberg REST compatibility: Read and write from Dremio Catalog using any engine or framework compatible with the Iceberg REST API. For example, use Spark or Flink to ingest data into the catalog, and then use Dremio to curate and serve data products built on that data.

  • Role-Based Access Control and Fine-Grained Access Control: Secure data using Role-Based Access Control (RBAC) privileges, and create row filters and column masks to ensure users only access the data they need. For example, create a column mask to obfuscate credit card numbers, or create a row filter on your employee details table that only returns rows with employees in your region.

  • Automated table maintenance: Dremio Catalog automates Iceberg maintenance operations like compaction and vacuum, which maximizes query performance, minimizes storage costs, and eliminates the need to run manual data maintenance. Dremio Catalog also simplifies Iceberg table management and eliminates the risk of poor performance from sub-optimal data layouts with support for Iceberg clustering keys.

  • Enable data analysts: Dremio Catalog is fully compatible with Dremio's built-in data product capabilities, including semantic search (use natural language to discover AI-ready data products), descriptions (use built-in descriptions and labels to understand how to use data products to answer business questions), and lineage (use lineage graphs to understand how data products are derived and transformed, and assess the impact of changes on downstream datasets).

Prerequisites

Ensure you have properly configured your storage settings based on your storage provider in the Dremio Helm chart. This configuration is required to enable support for vended credentials and to allow access to the table metadata necessary for Iceberg table operations.

Configuring Dremio Catalog as a Source

To add Dremio Catalog as a source:

  1. On the Datasets page, under Dremio Catalog in the left panel, click Add a Dremio Catalog.

    The New Dremio Catalog dialog box appears, which contains the following tabs:

    • General: Create a name for your Dremio Catalog source.

    • Storage: View your default storage URI and manage credentials to set up storage authentication and authorization.

    • Advanced Options: Manage catalog properties and automated table maintenance settings.

    • Reflection Refresh: (Optional) Set a policy to control how often reflections are refreshed and expired.

    • Metadata: (Optional) Specify dataset handling and metadata refresh.

    • Privileges: (Optional) Add privileges for users or roles.

    Refer to the following sections for guidance on how to edit each tab.

General

  1. In the Name field, enter a name for your Dremio Catalog.
note

The name you enter must be unique in the organization. Also, consider a name that is easy for users to reference. This name cannot be edited once the source is created. The name cannot exceed 255 characters and must contain only the following characters: 0-9, A-Z, a-z, underscore(_), or hyphen (-).

Storage

  1. The Default storage uri field will display the default storage location you configured in the Dremio Helm chart.

  2. Use the Storage access field to configure your preferred authentication method. Dremio Catalog supports two types of credentials for authentication:

  • Use credential vending (Recommended): Dremio Catalog provides the query engine executing the query with a temporary storage credential. This credential allows the query engine to access an Iceberg table's underlying directory location.

  • Use master storage credentials: The credentials authenticate access to all storage URIs within this catalog. These credentials ensure all resources are accessible through a single authentication method. This should be used if STS is unavailable or the vended credentials mechanism is disabled. Select the object storage provider that hosts the location specified in Default storage uri field:

    • S3: Select AWS for Amazon S3 and S3-compatible storage. You can refer to the Dremio documentation for connecting to Amazon S3, which is also applicable here.

    • Azure: Select Azure for Azure Blob Storage. You can refer to the Dremio documentation for connecting to Azure Storage, which is also applicable here.

Advanced Options

To set advanced options:

  1. (Optional) For Enable Asynchronous Access for Parquet Datasets, this option is enabled by default but you can uncheck the box to deactivate. Dremio enables asynchronous access and local caching when possible so that asynchronous requests do not wait for data to return from your storage. Activating this option can enable faster query times.

  2. Under Cache Options, review the following table and edit the options to meet your needs.

    Cache OptionsDescription
    Enable local caching when possibleSelected by default, along with asynchronous access for cloud caching, local caching can improve query performance. See Cloud Columnar Cache for details.
    Max percent of total available cache space to use when possibleSpecifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter in a percentage in the value field or use the arrows to the far right to adjust the percentage.
  3. Under Table maintenance, manage settings for automated table maintenance operations:

    • Enable auto optimization: Compacts small files into larger files. Clusters data if Iceberg clustering keys are set on the table.

    • Enable table cleanup: Deletes expired snapshots and orphaned metadata files.

Reflection Refresh

You can set the policy that controls how often reflections are scheduled to be refreshed automatically, as well as the time limit after which reflections expire and are removed. See the following options.

OptionDescription
Never refreshSelect to prevent automatic reflection refresh, default is to automatically refresh.
Refresh everyHow often to refresh reflections, specified in hours, days or weeks. This option is ignored if Never refresh is selected.
Set refresh scheduleSpecify the daily or weekly schedule.
Never expireSelect to prevent reflections from expiring, default is to automatically expire after the time limit below.
Expire afterThe time limit after which reflections expire and are removed from Dremio, specified in hours, days or weeks. This option is ignored if Never expire is selected.

Metadata

Specifying metadata options is handled with the following settings:

Dataset Handling

  • Remove dataset definitions if underlying data is unavailable (Default).
  • If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files.

Metadata Refresh

These are the optional Metadata Refresh parameters:

  • Dataset Discovery: The refresh interval for fetching top-level source object names such as databases and tables. Set the time interval using this parameter.

    ParameterDescription
    Fetch everyYou can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is 1 hour.
  • Dataset Details: The metadata that Dremio needs for query planning such as information needed for fields, types, shards, statistics, and locality. These are the parameters to fetch the dataset information.

    ParameterDescription
    Fetch modeYou can choose to fetch only from queried datasets. Dremio updates details for previously queried objects in a source. By default, this is set to Only Queried Datasets.
    Fetch everyYou can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is 1 hour.
    Expire afterYou can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is 3 hours.

Privileges

You have the option to grant privileges to specific users or roles. See Access Control for additional information about privileges.

To grant access to a user or role:

  1. For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.

  2. For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.

  3. Click Save after setting the configuration.

Updating a Dremio Catalog Source

To update a Dremio Catalog source:

  1. On the Datasets page, in the panel on the left, find the name of the Dremio Catalog source you want to edit.

  2. Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the The Settings icon at the top right corner of the page.

  3. In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name.

  4. Click Save.

note

Once you have configured Dremio Catalog, the Catalog REST APIs are accessible via http://{DREMIO_ADDRESS}:8181/api/catalog, where DREMIO_ADDRESS is the IP address of your Dremio cluster.

Using Dremio Catalog with Multiple Storage Locations

You can use one Dremio Catalog instance to work with data stored in multiple storage buckets. For example, you can create different folders (namespaces) in one Dremio Catalog instance, such that data in Folder A is stored in Storage Bucket 1, and data in Folder B is stored in Storage Bucket 2. This feature is named Storage URIs (Uniform Resource Identifiers).

A Storage URI is an optional attribute that can be attached to a folder, and consists of a path to an object storage location. When you create a folder, you can either configure the folder to use the “inherited” storage location you defined when you configured Dremio Catalog or when you set the Storage URI on one of its parent folders, or you can configure the folder to use a custom Storage URI. To configure the folder to use a custom Storage URI, add the path to the object storage location you would like to use during folder creation. Ensure that the Storage Credentials you are using for Dremio Catalog can access the object storage location you added for your newly-created folder.

Storage URIs Example

The diagram below depicts a Dremio Catalog that contains two namespaces (NS1, NS2), where its underlying folders utilize Storage URIs to store data in custom storage locations:

In this example:

  1. TBL1 would be stored in <Uri1>/NS3/TBL1
  2. TBL3 would be stored in <Uri2>/NS5/NS6/TBL3
  3. TBL4 would be stored in <Default uri>/NS2/TBL4
note

When creating a table from an external Dremio Catalog source, the default Storage URI that the table will use is the root path of the external Dremio Catalog source, unless one of the folders on the table’s path has been set with a custom Storage URI.