Dremio Arctic preview
Dremio Arctic is an intelligent metastore for Apache Iceberg, powered by Nessie. It provides a modern, cloud-native alternative to Hive Metastore. Arctic offers the following capabilities:
- Git-like data management: Brings Git-like version control to data lakes, enabling data engineers to manage the data lake with the same best practices Git enables for software development, including commits, tags, and branches.
- Works with all engines: Supports all Apache Iceberg-compatible technologies, including query engines (Dremio Sonar, Hive), processing engines (Spark), and streaming engines (Flink).
Dremio Arctic is a service within Dremio Cloud. You manage data in catalogs, which contain a listing of all the tables and views that are in the data lake along with their locations. Using catalogs, engines that access data in the data lake have a common view of all the available datasets and their current state.
An Arctic catalog consists of a metadata service. The data that you work with in Arctic, including Apache Iceberg metadata and manifest files, are stored in a storage system such as Amazon S3. Arctic maintains pointers to the Iceberg metadata files. Therefore, when using an engine such as Spark or Flink, you will need to ensure that the engine has the necessary credentials to access your S3 buckets.
The following illustration outlines the high-level architecture of Arctic:
The Dremio Arctic service is independent from the Dremio Sonar service. Arctic is a metastore that works with any query engine (such as Spark, Hive, Sonar), while Sonar is a query engine that works with any metastore directly (such as Hive Metastore, AWS Glue, and Arctic).
Objects in Arctic
When you work in the Arctic user interface, you work in or with the objects that are depicted in this diagram:
An Arctic catalog consists of one or more branches. A branch contains zero or more folders, tables and views. Additionally, an Arctic catalog is an Iceberg catalog and enables you to list and manage Iceberg tables and views.
A branch is a named reference and a movable pointer to a commit.
Folders are used to help you organize your tables in Arctic.
A table contains the data from your source, formatted as rows and columns. A table can be modified by query engines that connect to Arctic.
A view is a virtual table, created by running SQL statements or functions on a table or another view.
Git-like Data Management
Dremio Arctic is powered by Nessie, a native Apache Iceberg catalog that provides Git-like data management. As a result, data engineering teams can use commits, branches, and tags to be able to experiment on Apache Iceberg tables.
- Commit: A transaction affecting one or more tables/views. It may take place over a short or long period of time. Examples include:
- Updating one or more tables via SQL (INSERT, UPDATE, DELETE, MERGE, TRUNCATE) or Spark
- Updating the definition of a view
- Updating the schema of a table via SQL (ALTER TABLE) or Spark
- Branch: A movable pointer to a commit. Every time you commit, the branch pointer moves forward automatically.
- Tag: A named commit. You can tag a commit with a specific name so that users can refer to it without specifying a commit hash.
These capabilities enable a variety of use cases such as:
- Multi-statement transactions: With branches, data is updated in isolation and changes are merged atomically. The updates can be performed through a single engine (for example, SQL DML statements in Dremio Sonar) or through multiple engines (for example, ingest data in Spark and delete a record in Dremio Sonar), and may span any period of time and any number of users.
- Experimentation: Experimenting on the live lakehouse risks exposing incorrect or inconsistent data to other users. Instead, you can easily create a sandbox branch and experiment there. Because the data is not duplicated, there is no cost to creating a sandbox. And when you are done, the branch can be either deleted or your changes can be merged into the main branch.
- Reproducibility: The ability to retrain machine learning models and BI dashboards based on historical data is important for reproducible research and regulation. Arctic enables any engine to access previous versions of the lakehouse by referencing a specific commit, tag (a named commit), or timestamp.
- Governance: Arctic provides a user interface familiar to users of GitHub and GitLab that makes it easy to see every commit in every branch, so that you don’t have to wonder who updated or deleted a table, or where a table originated. This is true regardless of what engine was used to make the changes (Dremio Sonar, Spark, Flink, etc.).
The following illustration shows an example of a new branch that is forked from the main branch, then merged back atomically after multiple commits:
Currently, Arctic is available in Preview mode. To get started using Arctic:
- Set up an Arctic (Preview) catalog and connect to an engine using the Set Up Guide.
- Learn more about the Arctic catalogs.