Nessie Catalogs
Nessie catalogs enable you to process, manage, consume, and share data in the same way that code is shared during software development. That is, you are empowered to take control of your data using concepts including version control, commits, and testing and development in isolation from your production data. Dremio enables you to perform data as code activities using Project Nessie, which provides Git-like capabilities for the data lakehouse.
Prerequisites
Dremio supports Nessie version 0.59.0 and later. If you have not yet set up a Nessie server and connected it with your dataset, you can choose to either set up a server in a fast-start docker image or with secure HTTPS transport in Minikube.
When using Nessie as a source, Dremio can connect to Amazon S3 buckets, Azure Storage, or Google Cloud Storage (GCS). Read Storage for details about the required credentials for connecting to each storage provider.
Configuring Nessie as a Source
To add a Nessie source to your project:
-
On the Datasets page, to the right of Sources in the left panel, click .
-
In the Add Data Source dialog, under Nessie Catalogs, select Nessie.
The New Nessie Source dialog box appears, which contains the following sections:
-
General: Create a name for your Nessie source, specify the endpoint URL, and set the authentication type. The name cannot include the following special characters:
/
,:
,[
, or]
. -
Storage: Set the storage option by setting up the authentication type and the connection properties.
-
Advanced Options: (Optional) Use the default settings or, optionally, configure access preferences and cache options.
-
Privileges: (Optional) Add privileges for users or roles.
Refer to the following for guidance on how to edit each section.
-
General
This tab provides options for configuring connections to a Nessie source.
- In the Name field, enter a name.
The name you enter must be unique in the organization. Also, consider a name that is easy for users to reference. This name cannot be edited once the source is created. The name cannot exceed 255 characters and must contain only the following characters: 0-9, A-Z, a-z, underscore(_), or hyphen (-).
-
In the Nessie endpoint URL field, specify the IP address and port that you have set up for your Nessie server (e.g.,
https://localhost:19120/api/v2
). For more information, see Project Nessie Configuration. -
Under Nessie authentication type, select either None or Bearer:
-
None: The Nessie server does not require authentication.
-
Bearer: Set authentication using an OpenID bearer token. For more information about setting up this type of authentication, see Project Nessie's Authentication page. Then, choose a method for providing the Apache Druid password from the dropdown menu:
-
Dremio: Provide the bearer token in plain text. Dremio stores the bearer token.
-
Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the bearer token. The URI format is
https://<vault_name>.vault.azure.net/secrets/<secret_name>
(for example,https://myvault.vault.azure.net/secrets/mysecret
).noteTo use Azure Key Vault as your application secret store, you must:
- Deploy Dremio on Azure.
- Complete the Requirements for Authenticating with Azure Key Vault.It is not necessary to restart the Dremio coordinator when you rotate secrets stored in Azure Key Vault. Read Requirements for Secrets Rotation for more information.
-
AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the bearer token, which is available in the AWS web console or using command line tools.
-
HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and enter the secret reference for the bearer token in the correct format in the provided field.
-
Next, set up the storage options.
Storage
This tab enables you to configure the storage options for the Nessie source. Nessie sources use Amazon S3 only, so you must specify the AWS authentication method to use, if one is required. (See Prerequisites if you need to set up storage for the Nessie source). Additionally, you can set up connection properties and enable encryption of the connection.
Authentication
To connect an Amazon S3 bucket to the Nessie source, choose one of the following authentication methods:
-
AWS Access Key: Enables an IAM user or the AWS account root user to access the Amazon S3 bucket. You can choose to authenticate with both an AWS Access Key and an AWS Access Secret, or with an IAM Role to Assume field to authenticate to the specified S3 bucket.
Either the bucket or, if specified, the whitelisted bucket associated with the authentication method you are connecting with will be made available.
noteFor information about long-term credentials for an IAM user or the AWS account root user, see Managing access keys for IAM users.
-
Choice 1: AWS Access Key (for example,
AKIAIOSFODNN7EXAMPLE
) and AWS Access Secret: (for example:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
) -
Choice 2: IAM Role to Assume: An identity within your AWS account that has specific permissions.
-
-
EC2 Metadata: To authenticate to your Amazon S3 bucket using EC2 metadata, you need to provide an IAM role with privileges to the bucket. This role could be attached to the EC2 instance or to an IAM role to assume for connecting to the bucket. In either case, the role must provide privileges to use the bucket.
-
AWS Profile: Dremio reads your credentials from the specified AWS profile. For information on how to set up a configuration or credentials file for AWS, see AWS Custom Authentication.
- Profile Name (Optional) -- The AWS profile name. If this is left blank, then the default profile will be used. For more information about using profiles in a credentials or configuration file, see AWS's documentation on Configuration and credential file settings.
-
No Authentication: Select this option when you are connecting the Nessie source to a public Amazon S3 bucket.
To connect to S3-compatible storage like MinIO:
-
Choose AWS Access Key for authentication and provide the access key and secret.
-
Click Add property under Connection Properties and add the following properties:
- Add
fs.s3a.path.style.access
and set the value totrue
.
noteThis setting ensures that the request path is created correctly when using IP addresses or hostnames as the endpoint.
- Add the
fs.s3a.endpoint
property and its corresponding server endpoint value (IP address).
noteThe endpoint value cannot contain the
http(s)://
prefix nor can it start with the strings3
. For example, if the endpoint ishttp://123.1.2.3:9000
, the value is123.1.2.3:9000
.- Add
dremio.s3.compat
and set the value totrue
.
- Add
AWS Root Path
The root path to the Amazon S3 bucket. It is recommended that you either have a dedicated S3 bucket or, at least, a dedicated folder in which to store Nessie objects.
Example: /bucket-name/optional/folder/path
Locations in which Iceberg Tables are Created
Where the CREATE TABLE command creates a table depends on the type of data source being used. For Nessie data sources, top-level Nessie schemas have a configurable physical storage. This is used as the default root physical location.
In the project store, each top level Nessie schema has its own directory path. For example, in the project’s Nessie the top level schema, marketing
would be located in project_store/marketing
and this directory would be used by default as the root physical location. From there, the same schema.table resolution as described for Hive above would apply.
Connection Properties (Optional)
When using Nessie as a source, tables and metadata files are stored in an Amazon S3 bucket. This section enables you to provide the custom key value pairs for the connection relevant to the source.
- Click Add Property.
- For Name, enter a connection property.
- For Value, enter the corresponding connection property value.
(Optional) To secure the connections between the S3 buckets and Dremio, tick the Encrypt connection checkbox.
After configuring the General and Storage options, you can either save your settings or continue on to set up the optional settings.
- To save the configuration, click Save.
- To configure the optional settings, proceed to Advanced Options.
Nessie sources can use Amazon S3 buckets (AWS), Azure Storage (Azure), or Google Cloud Storage [Google (Preview)] as the storage provider.
- AWS
- Azure
- Google (Preview)
To connect an Amazon S3 bucket to the Nessie source, select the AWS storage provider option.
S3 Storage
In the field under AWS root path, provide the root path of the S3 bucket to use. We recommend that you have either a dedicated S3 bucket or a dedicated folder in which to store Nessie objects.S3 Authentication
Under Authentication method, choose the method you want to use to authenticate to Amazon S3.- AWS Access Key:
- In the field under AWS access key, provide the access key for the Amazon S3 account.
- Under AWS access secret, use the dropdown menu to choose a method for providing the access secret for the Amazon S3 account:
- Dremio: Provide the Amazon S3 access secret in plain text. Dremio stores the Amazon S3 access secret.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the Amazon S3 access secret. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the Amazon S3 access secret, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the Amazon S3 access secret in the correct format in the provided field.
- In the field under IAM role to assume, provide the ARN of the IAM role.
- EC2 Metadata: In the field under IAM role to assume, provide the ARN of an IAM role with privileges on the S3 bucket. This role could be attached to the EC2 instance or to an IAM role to assume for connecting to the S3 bucket. In either case, the role must provide privileges to use the S3 bucket.
- AWS Profile: In the field under AWS profile (optional), provide the AWS Profile name. If you leave the field blank, Dremio uses the default AWS Profile.
- No Authentication: Select this option if no credentials are required because you are connecting the Nessie source to a public Amazon S3 bucket.
Other: Connection Properties (Optional)
Provide the custom key-value pairs for the connection relevant to the source.- Click Add Property.
- For Name, provide a connection property.
- For Value, provide the corresponding value for the connection property.
Other: Encrypt connection
Optional: To secure the connections between the Amazon S3 bucket and Dremio, select the Encrypt connection checkbox.To save the configuration, click Save. To configure additional settings, proceed to Advanced Options.
To connect Azure Storage to the Nessie source, select the Azure storage provider option.
Azure Storage
- In the field under Storage Account Name, provide the name of the Azure Storage account to use.
- In the field under Azure root path, provide the path in your Azure Storage account to the write location that Dremio should use for Iceberg metadata and data. The root path includes the name of the Azure Storage container, followed by the names of any folders (for example,
/containername/optional/folder/path
).
Azure Authentication
Under Authentication method, choose whether you want to authenticate to Azure Storage with a shared access key or Azure Active Directory.- Shared access key: Use the dropdown menu to choose a method for providing the shared access key for the Azure Storage account:
- Dremio: Provide the shared access key in plain text. Dremio stores the shared access key.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the shared access key. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the shared access key, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the shared access key in the correct format in the provided field.
- Azure Active Directory:
- In the field under Application ID, provide the ID for the application (client) in Azure.
- Under Client secret, use the dropdown menu to choose a method for providing the client secret for the Azure Storage account:
- Dremio: Provide the client secret in plain text. Dremio stores the client secret.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the client secret. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the client secret, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the client secret in the correct format in the provided field.
- In the field under OAuth 2.0 token endpoint, provide the OAuth 2.0 token endpoint (v1.0), including the tenant ID, that the application uses to get an access token or a refresh token.
Other: Connection Properties (Optional)
Provide the custom key-value pairs for the connection relevant to the source.- Click Add Property.
- For Name, provide a connection property.
- For Value, provide the corresponding value for the connection property.
Other: Encrypt connection
Optional: To secure the connections between Azure Storage and Dremio, select the Encrypt connection checkbox.To save the configuration, click Save. To configure additional settings, proceed to Advanced Options.
To connect Google Cloud Storage (GCS) to the Nessie source, select the Google storage provider option.
GCS Storage
- In the field under Google Project ID, provide the ID for your GCS project. You can find the ID in the Project info pane at the top-left of your screen on the GCS Home page.
- In the field under Google root path, provide the path for the GCS source that Dremio should use for Iceberg metadata and data.
GCS Authentication
Under Authentication method, choose whether you want to authenticate to GCS with a service account key or by automatic/service account.- Service Account Keys:
- In the field under Client Email, provide the email address associated with the GCS service account.
- In the field under Client ID, provide the client ID for your GCS key pair.
- In the field under Private Key ID, provide the key ID for your GCS key pair.
- Under Private Key, use the dropdown menu to choose a method for providing the private key for your GCS key pair:
- Dremio: Provide the private key in plain text. Dremio stores the private key.
- Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the private key. The URI format is
https://vault_name.vault.azure.net/secrets/secret_name
. To use Azure Key Vault as your application secret store, you must deploy Dremio on Azure and complete the requirements for authenticating with Azure Key Vault. - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the private key, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and provide the secret reference for the private key in the correct format in the provided field.
- Automatic/Service Account: If you are running Dremio on a Google Compute instance, Dremio uses the active service account for your instance and does not require any additional information to integrate with your data.
Other: Connection Properties (Optional)
Provide the custom key-value pairs for the connection relevant to the source.- Click Add Property.
- For Name, provide a connection property.
- For Value, provide the corresponding value for the connection property.
Other: Encrypt connection
Optional: To secure the connections between GCS and Dremio, select the Encrypt connection checkbox.To save the configuration, click Save. To configure additional settings, proceed to Advanced Options.
Advanced Options
Click Advanced Options in the left menu sidebar.
All advanced parameters are optional.
Review each option provided in the following table to set up the advanced options to meet your needs.
Advanced Option | Description |
---|---|
Enable asynchronous access when possible | Activated by default, uncheck the box to deactivate. Enables cloud caching for the S3 bucket to support simultaneous actions such as adding and editing a new source. |
Under Cache Options, review the following table and edit the options to meet your needs.
Cache Options | Description |
---|---|
Enable local caching when possible | Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option. For more information about local caching, see Columnar Cloud Cache. |
Max percent of total available cache space to use when possible | Specifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter in a percentage in the value field or use the arrows to the far right to adjust the percentage. |
Reflection Refresh
The Reflection Refresh section allows you to set a schedule for refreshing all of the reflections that are defined on tables in the catalog. You can override this schedule on individual tables in different branches. This section also lets you specify how long all reflections in the catalog exist until they expire. Again, you can override this setting on individual tables in different branches.
To learn more, see Refreshing Reflections and Setting the Expiration Policy for Reflections.
Privileges
On the Privileges tab, you can grant privileges to specific users or roles. See Access Controls for additional information about privileges.
All privileges are optional.
- For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
- For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.
At this point, a connection with the Nessie server is attempted. If a connection cannot be made, report the issue to the Project Nessie community's Zulip channel. You can also file a ticket on the Project Nessie community's GitHub page.
Updating a Nessie Source
To update a Nessie source:
- On the Datasets page, under Nessie Catalogs in the panel on the left, find the name of the source you want to edit.
- Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the at the top right corner of the page.
- In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name. For information about the settings options, see Configuring Nessie as a Source.
- Click Save.
Deleting a Nessie Source
If the source is in a bad state (for example, Dremio cannot authenticate to the source or the source is otherwise unavailable), only users who belong to the ADMIN role can delete the source.
To delete a Nessie source, perform these steps:
- On the Datasets page, click Sources > Nessie Catalogs in the panel on the left.
- In the list of data sources, hover over the name of the source you want to remove and right-click.
- From the list of actions, click Delete.
- In the Delete Source dialog, click Delete to confirm that you want to remove the source.
Deleting a source causes all downstream views that depend on objects in the source to break.
Limitations
- Changes to tables and views that are in Nessie sources are not logged. Nessie sources do not have audit logs. DX-64988
- The Catalog API is unable to retrieve or manage Nessie sources. DX-64994
- Dremio does not support moving, copying, or renaming tables and views in Nessie sources or removing the format from tables in Nessie sources.