Getting Started with Dremio Sonar and Arctic preview
Dremio Arctic is a data lakehouse management service that enables you to manage your data as code with Git-like operations and automatically optimizes your data for performance. Arctic provides a catalog for Apache Iceberg tables in your data lakehouse, and leveraging the open source Project Nessie, allows you to use Git-like operations for isolation, version control, and rollback of your dataset. These operations make it easy and efficient to experiment with, update, and audit your data with no data copies. You can use an Arctic catalog with different data lakehouse engines, including Dremio Sonar for lightning fast queries, Apache Spark for data processing, and Apache Flink for streaming data. For more information on Arctic, see Dremio Arctic.
The following tutorial walks you through creating and using an Arctic catalog. You’ll also use the Dremio Sonar query engine to create, query, and update a table in the Arctic catalog using SQL. You’ll use Arctic’s Git-like operations, branching and merging, to isolate and verify changes in your dataset before adding them to the
main branch. First, you’ll learn how to set up a Sonar project, create an Arctic Catalog from the Sonar UI, and then use Sonar to create a table in the catalog. Next, you’ll use Arctic’s Git-like operations to create a development branch of your table so you can update the data in that table in isolation of the
main branch. After you validate your updates, you can then merge your work to update the data in the catalog’s
After completing this tutorial, you’ll know how to:
- Add a Sonar project
- Add an Arctic catalog
- Create a table in the catalog
- Create a new branch on the table and insert new data into it
- Merge your changes into the main branch of the table
This tutorial assumes you are a new Dremio user and do not have an existing Sonar project available. If you have an existing project you want to use it for this tutorial, you can proceed directly to Step 2.
The examples below use the
NYC-taxi-trips dataset from Dremio’s Samples bucket in Amazon S3. This sample data is automatically available as a source in the first Sonar project you create in Dremio. After completing this tutorial, you can continue to use this sample dataset.
An Account in an Existing Dremio Organization
Before you begin, you need an active Dremio account within an existing Dremio organization. If your company has an existing Dremio organization, contact your Dremio administrator to create an account in that organization. If you do not have an existing Dremio organization, you will need to sign up for Dremio Cloud.
An AWS Account
If you need to create a Sonar project (Step 1), you need to have access to an Amazon Web Services (AWS) account and the following:
- An AWS account that you want to use with Dremio Cloud.
- An Amazon Virtual Private Cloud (Amazon VPC) and subnets where Dremio Cloud can launch and manage engines using Amazon EC2 instances.
- The AWS permissions to run a CloudFormation Template that creates the following items:
- An EC2 security group, which acts as a virtual firewall to control the traffic that is allowed to and from the resources that the security group is associated with. The security group ensures that only traffic from Dremio Cloud reaches the resources that you have allocated for your Dremio Cloud organization.
- An S3 bucket for storing various types of Sonar project data.
- An AWS Identity and Access Management (IAM) role or IAM user that is granted the permissions for Dremio to access the project store and to manage engines. For information about the permissions you are granting Dremio, see Policy Template to Grant Access to the Project Store and Policy Template for Enabling Dremio Cloud to Manage Engines.
- Outbound connectivity from your Amazon VPC and subnets is required to allow query engines to communicate with Dremio Cloud. Engines establish a connection with the Dremio Cloud control plane using port 443 (HTTPS) outbound to the internet. No open ports are required in your Amazon VPC, and neither subnets for incoming connections nor engines require public IP addresses.
For guidance to set up the AWS account, see Prerequisites in AWS and Permissions Required by Dremio. If you do not have access to an AWS account, contact your AWS administrator to help you get the needed access.
Step 1. Add a Sonar Project
First, you need to add a Dremio Sonar project. In this Getting Started Guide, you’ll use Sonar to create, query, and update tables in an Arctic catalog.
If you have an existing Sonar project you want to use for this tutorial, proceed directly to Step 2. Or, if you prefer to add another project to use for this tutorial, see Adding a Project.
To add a Sonar project:
- From the Organization home page > Services > Sonar, select Add Sonar Project.
- In the Add Sonar Project dialog box, under Project Name, enter a name for your project. You cannot use the same name for two projects in the same organization. For this tutorial, enter MyTestProject.
- Choose the AWS Region where you want Dremio to create your project store and other cloud resources.
For a list of supported Regions, see Supported Regions.
- Click Next.
- Under Configure resource, click Launch CloudFormation Template. The AWS Console opens in a new tab. Complete the following fields in the template:
- For Stack name, enter a name. For this tutorial, enter ArcticDemoStack.
- Under the Parameters section > Dremio Configuration, do the following:
- Select a VPC.
- Select the Subnet.
- Under Capabilities, tick the checkbox for the acknowledgement.
- Click Create stack.
The CloudFormation template sets up the necessary configurations and resources needed by Dremio Cloud. For more information, see Prerequisites in AWS and Permissions Required by Dremio. It can take up to a few minutes for the stack to be created in your AWS account. You can click the Refresh button to update the stack creation status. After the stack is created, return to Dremio. You’ll know the project was successfully created when you see the SQL Runner page in the Sonar project. You can immediately run a query in the SQL Runner, which displays a
SELECT * statement on the
NYC-taxi-trips table from Dremio’s Samples bucket in Amazon S3.
Step 2. Add an Arctic Catalog
Next, you need to create an Arctic catalog to store table information and enable you to update and manage your tables using Git-like operations. Although Arctic has its own service UI, you can also create an Arctic catalog directly from your Sonar project. In this Getting Started Guide, you’ll do the latter.
To add an Arctic catalog to your Sonar project:
- On the Datasets page, click Add Source at the bottom-left corner of the Data panel.
You can also click the Add Source icon next to Sources.
In the Add Data Source dialog box, under the Data-as-code section, select Arctic (Preview).
In the New Arctic (Preview) Source dialog box > General section > Choose a catalog, open the drop-down list of catalogs and select + New Arctic Catalog.
For this tutorial, you do not need to configure the other tabs (Storage, Advanced Options, and Privileges), which are set with defaults that you can edit as needed. For more information, see Connecting to Dremio Arctic.
In the New Arctic Catalog dialog box, in the Catalog Name field, enter a name for your catalog. For this tutorial, enter MyTestCatalog.
After the catalog has been added, click Save in the New Arctic (Preview) Source dialog box.
You’ll see the new Arctic catalog on the datasets page, under the datasets panel. You’ll be creating a table in this catalog in the next step.
Step 3. Create a Table in the Catalog
In this step, you’ll use Sonar to create a new table in the Arctic catalog using data from the NYC Taxi Trips data in Dremio’s Samples bucket in Amazon S3. The file format of this dataset is Apache Parquet, represented by a Table icon.
To use Sonar to create a table in the Arctic catalog:
From the side navigation bar, select the SQL Runner icon.
In the upper-right corner of the SQL Runner, verify that the Context (the catalog that you are using the dataset that the SQL statement applies to) shows the correct catalog (MyTestCatalog for this tutorial) and Ref (the branch that the SQL statement will use) shows that you are on the
mainbranch of the catalog.
To create a table in your catalog’s
mainbranch, run the following SQL statement in the SQL Runner. For this tutorial, you’ll use the table name sampleTaxiTable. The table created uses the Apache Iceberg open table format, and it will store table data and metadata in the bucket in Amazon S3 (the project store that you specified in the CloudFormation template in Step 1).
For this example, you’ll only create a table from 10 rows of the
NYC-taxi-tripsdataset to reduce table size and more easily show changes to the table in subsequent steps.
CREATE TABLE sampleTaxiTable AS SELECT * FROM Samples."samples.dremio.com"."NYC-taxi-trips" LIMIT 10
After the query completes, you’ll see the following result, which confirms the creation of the table, where it is stored (Path), and other details:
Step 4. Create a New Branch of the Table and Insert New Data Into It
In this step, you’ll create a branch in your catalog to isolate changes to your table from the
main branch. Similar to Git’s version control semantics for code, a branch in Arctic represents an independent line of development isolated from the
main branch. Arctic does this with no data copy, and enables you to update, insert, and delete data in tables on new branches without impacting the original table in the
main branch. When you are ready to update the
main branch from changes in a secondary branch, you can merge them (Step 5). For more information on branches and Arctic’s Git-like operations, see Git-like Data Management.
You can use Sonar to create a branch in your Arctic catalog using SQL:
- On the SQL Runner page in the upper-right corner, verify that the Context shows the correct catalog (MyTestCatalog for this tutorial) and Ref shows that you are on the
mainbranch of the catalog.
- In the SQL Runner, run the following SQL statement:
CREATE BRANCH etl
The query result will show the details about the new branch you created:
Next, let’s insert a new row of data into the sampleTaxiTable in the
etl branch of the catalog. This insert will isolate it from the
main branch of the catalog.
- Change the Context from the
mainbranch to the
etlbranch. First, verify that the Context displays the catalog name (MyTestCatalog). To change the Ref from the
mainbranch to the
- In the Ref field, select MyTestCatalog to open the branch dialog box.
- Under Other Branches, select
etland click Apply.
- Insert a new row into sampleTaxiTable using Sonar by running the following SQL statement:
INSERT INTO sampleTaxiTable VALUES ('2013-02-10 20:00:00.000', 9, 3.00, 10.0, 3.0, 13.0)
After you see a confirmation message at the top of the page that the row has been inserted into the table in the
etl branch, you can verify that your insert statement succeeded by running the following SQL statement:
SELECT COUNT(*) FROM sampleTaxiTable
The result has 11 rows, verifying that your row was correctly inserted (because you started with 10).
To demonstrate isolation across branches in your catalog, if you run the same query on the
main branch, the record count is still 10. This shows that your insert was not added to the
main branch. Any queries using the table in the
main branch would therefore not be affected by inserts into the
etl branch. In a production scenario, this is helpful because you can insert and validate new data in a new branch without affecting workloads on the
Step 5. Merge Your Changes Into the Main Branch of the Table
Now that you’ve inserted new data and validated the changes in the isolated
etl branch, you can safely merge these changes into the
main branch. In a production scenario, this would be the process to update the
main branch that is consumed by production workloads.
You’ll use Sonar to merge changes to the
- First, change the Ref from the
etlbranch to the
mainbranch. Verify that the Context displays the catalog name (MyTestCatalog). To change the Ref from the
etlbranch to the
- From Ref, hover your cursor over the branch name,
etl, and click it to open the branch dialog box.
- Under Default, select
mainand click Apply.
- From Ref, hover your cursor over the branch name,
- Merge the
mainby running the following SQL statement:
Merging a Branch
MERGE BRANCH etl
When the merge is completed, you’ll see a confirmation message at the top of the page that the
etl branch has successfully merged into the
main branch. This means that the changes to the table in the
etl branch are now also changed in the
main branch. The
etl branch is retained in the catalog until you manually drop the branch.
To see that the data in the
main branch is updated, you can count the rows using this query:
SELECT COUNT(*) FROM sampleTaxiTable
You’ll now see 11 rows in the table, reflecting the inserted row made in and merged from the etl branch.
Recover From Mistakes
You can quickly and efficiently recover from a mistake when working in an Arctic catalog. Every change you make is contained in a commit. So you can simply revert to a previous commit. And since multiple table transactions can be done in a commit, you can roll back changes made to multiple tables easily using one command. The example below demonstrates the command operating on one commit that applies to only one table.
Take the following steps to get the commit ID and revert to a previous commit:
- From the SQL Runner page (from Step 5), under the Data tab, locate MyTestCatalog and click the
- In the main branch dialog box, right panel, select the
- On the left panel, click the radio button next to Commit. Notice that all the commits in the
etlbranch become accessible.
- To copy a commit’s ID, hover to the right of a commit ID and click the Copy icon.
- Then in the SQL Runner, take the following actions:
a. In the upper-right corner, verify that the Context shows the correct catalog (MyTestCatalog) and the Ref shows that you are on the
etlbranch of the catalog.
b. Run the following query to revert your
etlbranch to the previous commit:
ALTER BRANCH etl ASSIGN COMMIT "<commit-id>"
You can query the table and verify that it has reverted to the previous commit:Querying the Table
SELECT * FROM sampleTaxiTable
Wrap-Up and Next Steps
Congratulations! You have successfully completed this tutorial and used Arctic’s Git-like operations to manage your data as code.
Arctic works with other engines including Apache Spark. For a tutorial on using Arctic with Apache Spark, see Getting Started with Apache Spark and Arctic.
To learn more about Dremio Arctic, see: