Nessie Catalogs
Nessie catalogs enable you to process, manage, consume, and share data in the same way that code is shared during software development. That is, you are empowered to take control of your data using concepts including version control, commits, and testing and development in isolation from your production data. Dremio enables you to perform data as code activities using Project Nessie, which provides Git-like capabilities for the data lakehouse.
Prerequisites
Dremio supports Nessie version 0.59.0 and later. If you have not yet set up a Nessie server and connected it with your dataset, you can choose to either set up a server in a fast-start Docker image or with secure HTTPS transport in Minikube.
When using Nessie as a source, Dremio can connect to Amazon S3 buckets, Azure Storage, Google Cloud Storage (GCS), or S3-compatible storage providers like MinIO and Dell ECS. Read Storage for details about the required credentials for connecting to each storage provider.
Configuring Nessie as a Source
To add a Nessie source to your project:
-
On the Datasets page, to the right of Sources in the left panel, click .
-
In the Add Data Source dialog, under Nessie Catalogs, select Nessie.
The New Nessie Source dialog box appears, which contains the following sections:
-
General: Create a name for your Nessie source, specify the endpoint URL, and set the authentication type. The name cannot include the following special characters:
/
,:
,[
, or]
. -
Storage: Set the storage option by setting up the authentication type and the connection properties.
-
Advanced Options: (Optional) Use the default settings or, optionally, configure access preferences and cache options.
-
Privileges: (Optional) Add privileges for users or roles.
Refer to the following for guidance on how to edit each section.
-
General
This tab provides options for configuring connections to a Nessie source.
- In the Name field, enter a name.
The name you enter must be unique in the organization. Also, consider a name that is easy for users to reference. This name cannot be edited once the source is created. The name cannot exceed 255 characters and must contain only the following characters: 0-9, A-Z, a-z, underscore(_), or hyphen (-).
-
In the Nessie endpoint URL field, specify the IP address and port that you have set up for your Nessie server (e.g.,
https://localhost:19120/api/v2
). For more information, see Project Nessie Configuration. -
Under Nessie authentication type, select either None or Bearer:
-
None: The Nessie server does not require authentication.
-
Bearer: Set authentication using an OpenID bearer token. For more information about setting up this type of authentication, see Project Nessie's Authentication page. Then, choose a method for providing the password from the dropdown menu:
-
Dremio: Provide the bearer token in plain text. Dremio stores the bearer token.
-
Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the bearer token. The URI format is
https://<vault_name>.vault.azure.net/secrets/<secret_name>
(for example,https://myvault.vault.azure.net/secrets/mysecret
).noteTo use Azure Key Vault as your application secret store, you must:
- Deploy Dremio on Azure.
- Complete the Requirements for Authenticating with Azure Key Vault.It is not necessary to restart the Dremio coordinator when you rotate secrets stored in Azure Key Vault. Read Requirements for Secrets Rotation for more information.
-
AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the bearer token, which is available in the AWS web console or using command line tools.
-
HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and enter the secret reference for the bearer token in the correct format in the provided field.
-
Next, set up the storage options.
Storage
Nessie sources can use Amazon S3 buckets (AWS), Azure Storage (Azure), Google Cloud Storage [Google (Preview)], or S3-compatible storage providers like MinIO and Dell ECS as storage.
- AWS
- Azure
- Google (Preview)
To connect an Amazon S3 bucket or a S3-compatible storage provider to the Nessie source, select the AWS storage provider option.
S3 Storage
In the field under AWS root path, provide the root path of the S3 bucket to use. We recommend that you have either a dedicated S3 bucket or a dedicated folder in which to store Nessie objects.Authentication
S3 Authentication
Under Authentication method, choose the method you want to use to authenticate to Amazon S3.- AWS Access Key:
- In the field under AWS access key, provide the access key for the Amazon S3 account.
- Under AWS access secret, use the dropdown menu to choose a method for providing the access secret for the Amazon S3 account:
- Dremio: Provide the Amazon S3 access secret in plain text. Dremio stores the Amazon S3 access secret.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the Amazon S3 access secret. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the Amazon S3 access secret, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the Amazon S3 access secret in the correct format in the provided field.
- In the field under IAM role to assume, provide the ARN of the IAM role.
- EC2 Metadata: In the field under IAM role to assume, provide the ARN of an IAM role with privileges on the S3 bucket. This role could be attached to the EC2 instance or to an IAM role to assume for connecting to the S3 bucket. In either case, the role must provide privileges to use the S3 bucket.
- AWS Profile: In the field under AWS profile (optional), provide the AWS Profile name. If you leave the field blank, Dremio uses the default AWS Profile.
- No Authentication: Select this option if no credentials are required because you are connecting the Nessie source to a public Amazon S3 bucket.
S3-Compatible Storage Provider Authentication
If you are connecting to S3-compatible storage like MinIO or Dell ECS, choose AWS access key for authentication and provide the access key and secret.Other: Connection Properties
Provide the custom key-value pairs for the connection relevant to the source.(Optional) If you are connecting to S3 storage, complete the following:
- Click Add Property.
- For Name, provide a connection property.
- For Value, provide the corresponding value for the connection property.
If you are connecting to S3-compatible storage like MinIO or Dell ECS, complete the following:
- Add
fs.s3a.path.style.access
and set the value totrue
. This setting ensures that the request path is created correctly when using IP addresses or hostnames as the endpoint. - Add
fs.s3a.endpoint
property and its corresponding server endpoint value (IP address). The endpoint value cannot contain thehttp(s)://
prefix nor can it start with the strings3
. For example, if the endpoint ishttp://123.1.2.3:9000
, the value is123.1.2.3:9000
. - Add
dremio.s3.compat
and set the value totrue
.
Other: Encrypt connection
Optional: To secure the connections between the Amazon S3 bucket and Dremio, select the Encrypt connection checkbox.To save the configuration, click Save. To configure additional settings, proceed to Advanced Options.
To connect Azure Storage to the Nessie source, select the Azure storage provider option.
Azure Storage
- In the field under Storage Account Name, provide the name of the Azure Storage account to use.
- In the field under Azure root path, provide the path in your Azure Storage account to the write location that Dremio should use for Iceberg metadata and data. The root path includes the name of the Azure Storage container, followed by the names of any folders (for example,
/containername/optional/folder/path
).
Azure Authentication
Under Authentication method, choose whether you want to authenticate to Azure Storage with a shared access key or Microsoft Entra ID.- Shared access key: Use the dropdown menu to choose a method for providing the shared access key for the Azure Storage account:
- Dremio: Provide the shared access key in plain text. Dremio stores the shared access key.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the shared access key. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the shared access key, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the shared access key in the correct format in the provided field.
- Microsoft Entra ID:
- In the field under Application ID, provide the ID for the application (client) in Azure.
- Under Client secret, use the dropdown menu to choose a method for providing the client secret for the Azure Storage account:
- Dremio: Provide the client secret in plain text. Dremio stores the client secret.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the client secret. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the client secret, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the client secret in the correct format in the provided field.
- In the field under OAuth 2.0 token endpoint, provide the OAuth 2.0 token endpoint (v1.0), including the tenant ID, that the application uses to get an access token or a refresh token.
Other: Connection Properties (Optional)
Provide the custom key-value pairs for the connection relevant to the source.- Click Add Property.
- For Name, provide a connection property.
- For Value, provide the corresponding value for the connection property.
Other: Encrypt connection
Optional: To secure the connections between Azure Storage and Dremio, select the Encrypt connection checkbox.To save the configuration, click Save. To configure additional settings, proceed to Advanced Options.
To connect Google Cloud Storage (GCS) to the Nessie source, select the Google storage provider option.
GCS Storage
- In the field under Google Project ID, provide the ID for your GCS project. You can find the ID in the Project info pane at the top-left of your screen on the GCS Home page.
- In the field under Google root path, provide the path for the GCS source that Dremio should use for Iceberg metadata and data.
GCS Authentication
Under Authentication method, choose whether you want to authenticate to GCS with a service account key or by automatic/service account.- Service Account Keys:
- In the field under Client Email, provide the email address associated with the GCS service account.
- In the field under Client ID, provide the client ID for your GCS key pair.
- In the field under Private Key ID, provide the key ID for your GCS key pair.
- Under Private Key, use the dropdown menu to choose a method for providing the private key for your GCS key pair:
- Dremio: Provide the private key in plain text. Dremio stores the private key.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the private key. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the private key, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the private key in the correct format in the provided field.
- Automatic/Service Account: If you are running Dremio on a Google Compute instance, Dremio uses the active service account for your instance and does not require any additional information to integrate with your data.
Other: Connection Properties (Optional)
Provide the custom key-value pairs for the connection relevant to the source.- Click Add Property.
- For Name, provide a connection property.
- For Value, provide the corresponding value for the connection property.
Other: Encrypt connection
Optional: To secure the connections between GCS and Dremio, select the Encrypt connection checkbox.To save the configuration, click Save. To configure additional settings, proceed to Advanced Options.
Advanced Options
Click Advanced Options in the left menu sidebar.
All advanced parameters are optional.
Review each option provided in the following table to set up the advanced options to meet your needs.
Advanced Option | Description |
---|---|
Enable asynchronous access when possible | Activated by default, uncheck the box to deactivate. Enables cloud caching for the S3 bucket to support simultaneous actions such as adding and editing a new source. |
Under Cache Options, review the following table and edit the options to meet your needs.
Cache Options | Description |
---|---|
Enable local caching when possible | Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option. For more information about local caching, see Columnar Cloud Cache. |
Max percent of total available cache space to use when possible | Specifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter in a percentage in the value field or use the arrows to the far right to adjust the percentage. |
Reflection Refresh
The Reflection Refresh section allows you to set a schedule for refreshing all of the reflections that are defined on tables in the catalog. You can override this schedule on individual tables in different branches. This section also lets you specify how long all reflections in the catalog exist until they expire. Again, you can override this setting on individual tables in different branches.
To learn more, see Refreshing Reflections and Setting the Expiration Policy for Reflections.
Privileges
On the Privileges tab, you can grant privileges to specific users or roles. See Access Controls for additional information about privileges.
All privileges are optional.
- For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
- For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.
At this point, a connection with the Nessie server is attempted. If a connection cannot be made, report the issue to the Project Nessie community's Zulip channel. You can also file a ticket on the Project Nessie community's GitHub page.
Updating a Nessie Source
To update a Nessie source:
- On the Datasets page, under Nessie Catalogs in the panel on the left, find the name of the source you want to edit.
- Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the at the top right corner of the page.
- In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name. For information about the settings options, see Configuring Nessie as a Source.
- Click Save.
Deleting a Nessie Source
If the source is in a bad state (for example, Dremio cannot authenticate to the source or the source is otherwise unavailable), only users who belong to the ADMIN role can delete the source.
To delete a Nessie source, perform these steps:
- On the Datasets page, click Sources > Nessie Catalogs in the panel on the left.
- In the list of data sources, hover over the name of the source you want to remove and right-click.
- From the list of actions, click Delete.
- In the Delete Source dialog, click Delete to confirm that you want to remove the source.
Deleting a source causes all downstream views that depend on objects in the source to break.
Limitations
- Changes to tables and views that are in Nessie sources are not logged. Nessie sources do not have audit logs. DX-64988
- The Catalog API is unable to retrieve or manage Nessie sources. DX-64994
- Dremio does not support moving, copying, or renaming tables and views in Nessie sources or removing the format from tables in Nessie sources.