Dremio Arctic Overview preview
What is Dremio Arctic?
Dremio Arctic is an intelligent metastore for Apache Iceberg, powered by Nessie. It provides a modern, cloud-native alternative to Hive Metastore, and is provided by Dremio as a forever-free service. Arctic offers the following capabilities:
- Git-like data management: Brings Git-like version control to data lakes, enabling data engineers to manage the data lake with the same best practices Git enables for software development, including commits, tags, and branches.
- Data optimization: Automatically maintains and optimizes data to enable faster processing and reduce the manual effort involved in managing a lake. This includes ensuring that the data is columnarized, compressed, compacted (for larger files), and partitioned appropriately when data and schemas are updated.
- Works with all engines: Supports all Apache Iceberg-compatible technologies, including query engines (Dremio Sonar, Presto, Trino, Hive), processing engines (Spark), and streaming engines (Flink).
Dremio Arctic is a managed service within Dremio Cloud. An Arctic-enabled project in Dremio Cloud includes a single repository, which is the top-level data container. Arctic consists of two primary services:
- Metadata service: Unlike Git services such as GitHub, Dremio Arctic does not host the actual data, or even the Apache Iceberg metadata and manifest files. These files are stored, by the various engines, on a storage system such as Amazon S3. Arctic simply maintains pointers to the Iceberg metadata files. Therefore, when using an engine such as Spark or Flink, you will need to ensure that the engine has the necessary credentials to access your S3 buckets.
- Data optimization service: Dremio Arctic can automatically maintain and optimize your Iceberg tables to ensure high-performance access to the data. This is helpful even for advanced data engineering teams, because as data is inserted, updated, and deleted; and as schemas evolve, there is a need to compact, partition, sort, and garbage collect data. When enabling data optimization, you will need to ensure that Dremio Arctic has the necessary credentials to access your S3 buckets.
The following illustration outlines the high-level architecture of Dremio Arctic:
The Dremio Arctic service is independent from Dremio Sonar. Dremio Arctic is a metastore that works with any query engine (such as Spark, Trino, Presto, Hive, Dremio Sonar), while Dremio Sonar is a query engine that works with any metastore directly (such as Hive Metastore, AWS Glue, and Amazon S3).
Git-like Data Management
Dremio Arctic is powered by Nessie, a native Apache Iceberg catalog that provides Git-like data management. It is Git-like because it isn’t actually using Git under the hood (that would be orders of magnitude too slow in terms of transactions per second), and it doesn’t have some of the more complicated features. But it does use similar concepts that Git users are familiar with and use regularly such as commits, branches, and tags.
- Commit: A transaction affecting one or more tables/views. It may take place over a short or long period of time. Examples include:
- Updating one or more tables via SQL (INSERT, UPDATE, DELETE, MERGE, TRUNCATE) or Spark
- Updating the definition of a view (coming soon)
- Updating the schema of a table via SQL (ALTER TABLE) or Spark
- Branch: A movable pointer to a commit. Every time you commit, the branch pointer moves forward automatically.
- Tag: An immutable branch. You can tag a commit with a specific name so that users can refer to it without specifying a commit hash.
These capabilities enable a variety of use cases such as:
- Multi-statement transactions: With branches, data is updated in isolation and changes are merged atomically. The updates can be performed through a single engine (for example, SQL DML statements in Dremio Sonar) or through multiple engines (for example, ingest data in Spark and delete a record in Dremio Sonar), and may span any period of time and any number of users.
- Experimentation: Experimenting on the live lakehouse risks exposing incorrect or inconsistent data to other users. Instead, you can easily create a sandbox branch and experiment there. Because the data is not duplicated, there is no cost to creating a sandbox. And when you are done, the branch can be either deleted or your changes can be merged into the main branch.
- Reproducibility: The ability to retrain machine learning models and BI dashboards based on historical data is important and sometimes required by regulators. Arctic enables any engine to access previous versions of the lakehouse by referencing a specific commit, tag (a named commit), or timestamp.
- Governance: Arctic provides a GitHub-inspired interface that makes it easy to see every commit in every branch, so that you don’t have to wonder who updated or deleted a table, or where a table originated. This is true regardless of what engine was used to make the changes (Dremio Sonar, Spark, Trino, Flink, etc.).
The following illustration shows an example of a new branch that is forked from the main branch, then merged back atomically after multiple commits:
Data Optimization (coming soon)
Dremio Arctic includes a data optimization service that maintains and optimizes the Iceberg tables. There are numerous reasons why this service is beneficial even to the most advanced data team:
- Iceberg enables users to perform record-level mutations, such as inserting, updating, and deleting specific records. Under the hood, these mutations result in new Parquet files. It is important to integrate these mutations into the table over time to avoid performance degradation.
- Iceberg enables you to modify the schema of a table. This may include changing the data type of a column, changing the partition columns, changing the sort columns, etc. While Iceberg allows these changes to happen instantaneously, the data is not automatically updated. It is important to update the data over time to ensure that the chosen schema (for example, new partition columns) is reflected in the data.
- Over time, there may be metadata and data files that are no longer needed. It is important to identify and delete these unneeded files to avoid unnecessary storage costs as well as performance degradation.
The following types of optimizations are currently supported:
- Garbage collection
Additional optimizations will be available in the GA release.
Arctic currently relies on Amazon EMR to schedule the optimization jobs in your AWS account. Therefore, when creating an Arctic-enabled project in Dremio Cloud, the necessary IAM permissions for Amazon EMR are requested.
Dremio Arctic complements Apache Iceberg, and therefore works with all Iceberg-compatible engines and tools, including:
- Dremio Sonar
- Trino (coming soon)
- Presto (coming soon)
Access Management (coming soon)
Dremio Arctic supports database-style privileges. Privileges on tables, views, commits, tags, and branches can be granted to users through the Arctic REST API or with SQL GRANT commands through Dremio Sonar.
To learn more about users, roles, and privileges in Dremio Cloud, see Access Management.
Currently, Arctic is available in Preview mode. To get started using Arctic with Nessie: