Skip to main content
Version: current [26.x]

Open Catalog Enterprise

Dremio's built-in lakehouse catalog is built on Apache Polaris (incubating). The catalog enables centralized, secure read and write access to your Iceberg tables across different REST-compatible query engines and automates data maintenance operations to maximize query performance. Key features include:

  • Iceberg REST compatibility: Read and write from the Open Catalog using any engine or framework compatible with the Iceberg REST API. For example, use Spark or Flink to ingest data into the catalog, and then use Dremio to curate and serve data products built on that data.
  • Role-Based Access Control and Fine-Grained Access Control: Secure data using Role-Based Access Control (RBAC) privileges and create row filters and column masks to ensure users only access the data they need. For example, create a column mask to obfuscate credit card numbers or create a row filter on your employee details table that only returns rows with employees in your region.
  • Automated table maintenance: Open Catalog automates Iceberg maintenance operations like compaction and vacuum, which maximizes query performance, minimizes storage costs, and eliminates the need to run manual data maintenance. Open Catalog also simplifies Iceberg table management and eliminates the risk of poor performance from suboptimal data layouts with support for Iceberg clustering keys.
  • Enable data analysts: Open Catalog is fully compatible with Dremio's built-in data product capabilities, including semantic search (use natural language to discover AI-ready data products), descriptions (use built-in descriptions and labels to understand how to use data products to answer business questions), and lineage (use lineage graphs to understand how data products are derived and transformed and assess the impact of changes on downstream datasets).

This page provides instructions for configuring the Open Catalog. If you would like to connect to Open Catalogs deployed in other Dremio instances, see Open Catalog (External).

Prerequisites

Before you configure Open Catalog, make sure you do the following:

These configurations are required to enable support for vended credentials and to allow access to the table metadata necessary for Iceberg table operations.

Configure the Open Catalog

To configure Open Catalog:

  • When creating the first Open Catalog, select Add an Open Catalog. Add a Name for the catalog.
  • When configuring an existing Open Catalog, right-click on your catalog and select Settings from the dropdown.

Storage

  1. The Default storage URI field displays the default storage location you configured in Dremio's Helm chart.
  2. Use the Storage access field to configure your preferred authentication method. Open Catalog supports two types of credentials for authentication:
    • Use credential vending (Recommended): Credential vending is a security mechanism where the catalog service issues temporary, scoped access credentials to the query engine for accessing table storage. The engine is "vended" a temporary credential just in time for the query.
    • Use master storage credentials: The credentials authenticate access to all storage URIs within this catalog. These credentials ensure all resources are accessible through a single authentication method. This should be used if STS is unavailable or the vended credentials mechanism is disabled. Select the object storage provider that hosts the location specified in the Default storage URI field:
      • AWS – Select AWS for Amazon S3 and S3-compatible storage. You can refer to the Dremio documentation for connecting to Amazon S3, which is also applicable here. When selecting to assume an IAM role, ensure that the role policy grants access to the bucket or folder specified in the Default storage URI field.
      • Azure – Select Azure for Azure Blob Storage. You can refer to the Dremio documentation for connecting to Azure Storage, which is also applicable here.
      • Google Cloud Storage – Select Google for Google Cloud Storage (GCS). You can refer to the Dremio documentation for connecting to GCS, which is also applicable here.
  3. Enter any required storage connection properties in the Connection Properties field. Refer to the Advanced Options section for your storage provider (Amazon S3, Azure, or GCS) for available properties.

Advanced Options

To set advanced options:

  1. Under Cache Options, review the following table and edit the options to meet your needs.

    Cache OptionsDescription
    Enable local caching when possibleSelected by default. Along with asynchronous access for cloud caching, local caching can improve query performance. See Cloud Columnar Cache for details.
    Max percent of total available cache space to use when possibleSpecifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter a percentage in the value field or use the arrows to the far right to adjust the percentage.
  2. Under Table maintenance, manage settings for automated table maintenance operations:

    • Enable auto optimization: Compacts small files into larger files. Clusters data if Iceberg clustering keys are set on the table.
    • Enable table cleanup: Deletes expired snapshots and orphaned metadata files.

Reflection Refresh

You can set the policy that controls how often Reflections are scheduled to be refreshed automatically, as well as the time limit after which Reflections expire and are removed. See the following options:

OptionDescription
Never refreshSelect to prevent automatic Reflection refresh. The default is to automatically refresh.
Refresh everyHow often to refresh Reflections, specified in hours, days, or weeks. This option is ignored if Never refresh is selected.
Set refresh scheduleSpecify the daily or weekly schedule.
Never expireSelect to prevent Reflections from expiring. The default is to automatically expire after the time limit below.
Expire afterThe time limit after which Reflections expire and are removed from Dremio, specified in hours, days, or weeks. This option is ignored if Never expire is selected.

Metadata

Specifying metadata options is handled with the following settings:

Dataset Handling

  • Remove dataset definitions if the underlying data is unavailable (default).
  • If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files.

Metadata Refresh

These are the optional Metadata Refresh parameters:

  • Dataset Discovery: The refresh interval for fetching top-level source object names, such as databases and tables. Set the time interval using this parameter.

    ParameterDescription
    Fetch everyYou can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is 1 hour.
  • Dataset Details: The metadata that Dremio needs for query planning, such as information needed for fields, types, shards, statistics, and locality. These are the parameters to fetch the dataset information.

    ParameterDescription
    Fetch modeYou can choose to fetch only from queried datasets. Dremio updates details for previously queried objects in a source. By default, this is set to Only Queried Datasets.
    Fetch everyYou can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is 1 hour.
    Expire afterYou can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is 3 hours.

Privileges

You have the option to grant privileges to specific users or roles. See Access Control for additional information about privileges.

To grant access to a user or role:

  1. For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
  2. For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
  3. Click Save after setting the configuration.

Configure Storage Access

To configure access to the storage, select your storage provider below and follow the steps:

S3 and STS Access via IAM Role (Preferred)

  1. Create an Identity and Access Management (IAM) user or use an existing IAM user for Open Catalog.

  2. Create an IAM policy that grants access to your S3 location. For example:

    Example of a policy
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:GetObjectVersion",
    "s3:DeleteObject",
    "s3:DeleteObjectVersion"
    ],
    "Resource": "arn:aws:s3:::<my_bucket>/*"
    },
    {
    "Effect": "Allow",
    "Action": [
    "s3:ListBucket",
    "s3:GetBucketLocation"
    ],
    "Resource": "arn:aws:s3:::<my_bucket>",
    "Condition": {
    "StringLike": {
    "s3:prefix": [
    "*"
    ]
    }
    }
    }
    ]
    }
  3. Create an IAM role to grant privileges to the S3 location.

    1. In your AWS console, select Create Role.
    2. Enter an externalId. For example, my_catalog_external_id.
    3. Attach the policy created in the previous step and create the role.
  4. Create IAM user permissions to access the bucket via STS:

    The sts:AssumeRole permission is required for Open Catalog to function with vended credentials, as it relies on the STS temporary token to perform these validations.

    1. Select the IAM role created in the previous step.

    2. Edit the trust policy and add the following:

      Trust policy
      {
      "Version": "2012-10-17",
      "Statement": [
      {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
      "AWS": "<dremio_catalog_user_arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
      "StringEquals": {
      "sts:ExternalId": "<dremio_catalog_external_id>"
      }
      }
      }
      ]
      }

      Replace the following values with the ones obtained in the previous steps:

      • <dremio_catalog_user_arn> - The IAM user that was created in the first step.
      • <dremio_catalog_external_id> - The external ID that was created in the third step.

S3 and STS Access via Access Key

  1. In the Dremio console, select Use master storage credentials when adding Open Catalog.

  2. The access keys must have permissions to access the bucket and the STS server.

  3. Create a Kubernetes secret named catalog-server-s3-storage-creds to access the configured location. Here is an example for S3 using an access key and secret key:

    Run kubectl to create the Kubernetes secret
    export AWS_ACCESS_KEY_ID=<username>
    export AWS_SECRET_ACCESS_KEY=<password>
    kubectl create secret generic catalog-server-s3-storage-creds \
    --namespace $NAMESPACE \
    --from-literal awsAccessKeyId=$AWS_ACCESS_KEY_ID \
    --from-literal awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY

Update an Open Catalog Source

To update an Open Catalog source:

  1. On the Datasets page, in the panel on the left, find the name of the Open Catalog source you want to edit.

  2. Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the The Settings icon at the top right corner of the page.

  3. In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name.

  4. Click Save. Once you have configured Open Catalog, the Catalog REST APIs are accessible via http://{DREMIO_ADDRESS}:8181/api/catalog, where DREMIO_ADDRESS is the IP address of your Dremio cluster.

Using the Open Catalog with Multiple Storage Locations

You can use one Open Catalog instance to work with data stored in multiple storage buckets. For example, you can create different folders (namespaces) in one Open Catalog instance, such that data in Folder A is stored in Storage Bucket 1, and data in Folder B is stored in Storage Bucket 2. This feature is named Storage URIs (Uniform Resource Identifiers).

A Storage URI is an optional attribute that can be attached to a folder and consists of a path to an object storage location. When you create a folder, you can either configure the folder to use the "inherited" storage location you defined when you configured Open Catalog or when you set the Storage URI on one of its parent folders, or you can configure the folder to use a custom Storage URI. To configure the folder to use a custom Storage URI, add the path to the object storage location you would like to use during folder creation. Ensure that the storage credentials you are using for the Open Catalog can access the object storage location you added for your newly created folder.

Storage URIs Example

The diagram below depicts an Open Catalog that contains two namespaces (NS1, NS2), where its underlying folders utilize Storage URIs to store data in custom storage locations:

In this example:

  1. TBL1 would be stored in <Uri1>/NS3/TBL1
  2. TBL3 would be stored in <Uri2>/NS5/NS6/TBL3
  3. TBL4 would be stored in <Default URI>/NS2/TBL4

When creating a table from an external Open Catalog source, the default Storage URI that the table will use is the root path of the external Open Catalog source, unless one of the folders on the table's path has been set with a custom Storage URI.