Skip to main content

Data Branching

With Arctic, you can manage your data with the same best practices Git enables for software development. For example, you can:

  • Create branches to make changes to data without disrupting production workloads, without requiring separate dev/test environments
  • Merge changes from a development branch into production only when data quality has been validated
  • Immediately undo changes and recover from mistakes
  • Reproduce models and analyses with catalog-level time travel

Data branching enables you to eliminate infrastructure costs associated with duplicated environments and pipelines, give line of business users immediate access to fresh data, and immediately rollback from mistakes without data downtime. Branching also makes it easy for data analysts and data scientists to run experiments and create models on their entire lakehouse without disturbing production workloads.

The image below shows an example of a workflow Arctic enables, where data engineers can create branches to make and validate changes to data in isolation before merging changes into the main branch:.

With branching, data engineers eliminate the need for separate physical environments for development and testing, as well as the need to replicate changes between them.

Key Concepts

Project Nessie

Arctic is powered by Project Nessie, a native Apache Iceberg catalog that enables a Git-like experience on Iceberg tables and views. Nessie implements many concepts from Git for source code to data lakes, including commits, branches, and tags.

  • As a catalog, Nessie provides a consistent view of your lakehouse environment, across all tables and views.
  • Nessie does not store data files, only pointers to them. Data files are separate from the Nessie catalog, and are stored inside the customer's account.
  • Nessie does not physically copy data when creating branches. For each branch, Nessie maintains a list of files associated with each table in the catalog. If you make changes to a table in a specific branch, then Nessie will just update the list of files related to the table in that branch.
note

To learn more about Project Nessie, visit About Nessie.

Glossary

TermDefinition in Arctic
CommitThe state of all objects in the catalog at a point in time. Each atomic change to tables/views in the catalog (e.g., updating one or more tables, updating the definition of a view, or updating the schema of a table) creates a new commit. Every commit (aside from the very first commit in a catalog) has references to its predecessor, and therefore previous versions of the data.
BranchA named reference to a series of commits. Every time you make a commit to a branch, the branch automatically updates to refer to the latest version of data.
TagA named reference to a specific commit. You can tag a commit with a simple name so users can easily refer to data at a specific point in time without having to remember a commit hash.