Nessie
Nessie is an intelligent metastore and catalog for Apache Iceberg. It provides a modern alternative to Hive Metastore for Iceberg tables and views and provides many advanced features for more effective data lakes. These features include:
- Adding or changing data on a versioned branch, testing that branch for quality, and merging the changes to general user availability, all within the same data lake and without impacting production data.
- Creating specialized versions of data for specific use cases.
- Atomically updating many tables, with many changes, thus eliminating data inconsistencies and aberrations in the middle of a change sequence.
Architecture
The Nessie service is a lightweight Java-based REST API server. Nessie uses configurable authentication and a configurable backend datastore (which currently supports multiple database types). This architecture allows Nessie to run in one or more Docker instances according to capacity requirements. The Nessie Helm chart deploys the front end load balancer and assists with other details such as the configuration HTTPS. The Nessie JAR file can be deployed when a single Nessie instance is required for test purposes, or for a local development or test environment.
Objects in Nessie
When working with a Nessie source, you work in or with the following objects:
- Branch: A named reference and a movable pointer to a commit.
- Folders: Used to help you organize your tables in a Nessie source.
- Tables: Contains the data from your source, formatted as rows and columns. A table can be modified by query engines that connect to your Nessie source.
- Views: A virtual table, created by running SQL statements or functions on a table or another view.
You can create and store Apache Iceberg tables and views in the Nessie catalog. No other file or source types can be stored in the Nessie catalog.
Git-like Data Management
Nessie is a native Apache Iceberg catalog that provides Git-like data management. As a result, data engineering teams can use commits, branches, and tags to be able to experiment on Apache Iceberg tables.
- Commit: A transaction affecting one or more tables or views. It may take place over a short or long period of time. Examples include:
- Updating a table using Dremio Sonar (
INSERT
,UPDATE
,DELETE
,MERGE
,TRUNCATE
) or another engine such as Spark - Updating a view or the definition of a view
- Updating the schema of a table via SQL (
ALTER TABLE
) or Spark
- Updating a table using Dremio Sonar (
- Branch: A movable pointer to a commit. Every time you commit, the branch pointer moves forward automatically. Branches can be merged via a commit.
- Tag: A named commit. You can tag a commit with a specific name so that users can refer to it without specifying a commit hash.
These capabilities enable a variety of use cases such as:
- Multi-statement transactions: With branches, data is updated in isolation and changes are merged atomically. The updates can be performed through a single engine (for example, SQL DML statements in Dremio Sonar) or through multiple engines (for example, ingest data in Spark and delete a record in Dremio Sonar), and may span any period of time and any number of users.
- Experimentation: Experimenting on the live lakehouse risks exposing incorrect or inconsistent data to other users. Instead, you can easily create a sandbox branch and experiment there. Because the data is not duplicated, there is no cost to creating a sandbox. And when you are done, the branch can be either deleted or your changes can be merged into the main branch.
- Reproducibility: The ability to retrain machine learning models and BI dashboards based on historical data is important for reproducible research and regulation. Nessie enables any engine to access previous versions of the lakehouse by referencing a specific commit, tag (a named commit), or timestamp.
- Governance: Nessie provides a user interface familiar to users of GitHub and GitLab that makes it easy to see every commit in every branch, so that you don’t have to wonder who updated or deleted a table, or where a table originated.
The following illustration shows an example of a new branch that is forked from the main branch, then merged back atomically after multiple commits:
Getting Started
To get started using Nessie, you will need to deploy a Nessie server and then add it as a source in Dremio. For more information, see Configuring Nessie as a Source.