On this page

    Getting Started with Dremio Sonar and Arctic preview

    Dremio Arctic is a data lakehouse management service that enables you to manage your data as code with Git-like operations and automatically optimizes your data for performance. Arctic provides a catalog for Apache Iceberg tables in your data lakehouse, and leveraging the open source Project Nessie, allows you to use Git-like operations for isolation, version control, and rollback of your dataset. These operations make it easy and efficient to experiment with, update, and audit your data with no data copies. You can use an Arctic catalog with different data lakehouse engines, including Dremio Sonar for lightning fast queries, Apache Spark for data processing, and Apache Flink for streaming data. For more information on Arctic, see Dremio Arctic.

    Goal

    The following tutorial walks you through creating and using an Arctic catalog. You’ll also use the Dremio Sonar query engine to create, query, and update a table in the Arctic catalog using SQL. You’ll use Arctic’s Git-like operations, branching and merging, to isolate and verify changes in your dataset before adding them to the main branch. First, you’ll learn how to set up a Sonar project, create an Arctic Catalog from the Sonar UI, and then use Sonar to create a table in the catalog. Next, you’ll use Arctic’s Git-like operations to create a development branch of your table so you can update the data in that table in isolation of the main branch. After you validate your updates, you can then merge your work to update the data in the catalog’s main branch.

    After completing this tutorial, you’ll know how to:

    1. Add a Sonar project
    2. Add an Arctic catalog
    3. Create a table in the catalog
    4. Create a new branch on the table and insert new data into it
    5. Merge your changes into the main branch of the table

    This tutorial assumes you are a new Dremio user and do not have an existing Sonar project available. If you have an existing project you want to use it for this tutorial, you can proceed directly to Step 2.

    The examples below use the NYC-taxi-trips dataset from Dremio’s Samples bucket in Amazon S3. This sample data is automatically available as a source in the first Sonar project you create in Dremio. After completing this tutorial, you can continue to use this sample dataset.

    Prerequisites

    An Account in an Existing Dremio Organization

    Before you begin, you need an active Dremio account within an existing Dremio organization. If your company has an existing Dremio organization, contact your Dremio administrator to create an account in that organization. If you do not have an existing Dremio organization, you will need to sign up for Dremio Cloud.

    An AWS Account

    If you need to create a Sonar project (Step 1), you need to have access to an Amazon Web Services (AWS) account and the following:

    • An AWS account that you want to use with Dremio Cloud.
    • An Amazon Virtual Private Cloud (Amazon VPC) and subnets where Dremio Cloud can launch and manage engines using Amazon EC2 instances.
    • The AWS permissions to run a CloudFormation Template that creates the following items:
      • An EC2 security group, which acts as a virtual firewall to control the traffic that is allowed to and from the resources that the security group is associated with. The security group ensures that only traffic from Dremio Cloud reaches the resources that you have allocated for your Dremio Cloud organization.
      • An S3 bucket for storing various types of Sonar project data.
      • An AWS Identity and Access Management (IAM) role or IAM user that is granted the permissions for Dremio to access the project store and to manage engines. For information about the permissions you are granting Dremio, see Policy Template to Grant Access to the Project Store and Policy Template for Enabling Dremio Cloud to Manage Engines.
    • Outbound connectivity from your Amazon VPC and subnets is required to allow query engines to communicate with Dremio Cloud. Engines establish a connection with the Dremio Cloud control plane using port 443 (HTTPS) outbound to the internet. No open ports are required in your Amazon VPC, and neither subnets for incoming connections nor engines require public IP addresses.

    For guidance to set up the AWS account, see Prerequisites in AWS and Permissions Required by Dremio. If you do not have access to an AWS account, contact your AWS administrator to help you get the needed access.

    Step 1. Add a Sonar Project

    First, you need to add a Dremio Sonar project. In this Getting Started Guide, you’ll use Sonar to create, query, and update tables in an Arctic catalog.

    note:

    If you have an existing Sonar project you want to use for this tutorial, proceed directly to Step 2. Or, if you prefer to add another project to use for this tutorial, see Adding a Project.

    To add a Sonar project:

    1. From the Organization home page > Services > Sonar, select Add Sonar Project. The Dremio Cloud Organization home page highlighting the Add Sonar Project button.
    2. In the Add Sonar Project dialog box, under Project Name, enter a name for your project. You cannot use the same name for two projects in the same organization. For this tutorial, enter MyTestProject.
      The Add Sonar Project dialog box.
    3. Choose the AWS Region where you want Dremio to create your project store and other cloud resources.
      Note:
      For a list of supported Regions, see Supported Regions.
    4. Click Next.
    5. Under Configure resource, click Launch CloudFormation Template. The AWS Console opens in a new tab. Complete the following fields in the template:
      1. For Stack name, enter a name. For this tutorial, enter ArcticDemoStack.
      2. Under the Parameters section > Dremio Configuration, do the following:
        1. Select a VPC.
        2. Select the Subnet.
      3. Under Capabilities, tick the checkbox for the acknowledgement.
        AWS CloudFormation template to create a stack.
    6. Click Create stack.

    The CloudFormation template sets up the necessary configurations and resources needed by Dremio Cloud. For more information, see Prerequisites in AWS and Permissions Required by Dremio. It can take up to a few minutes for the stack to be created in your AWS account. You can click Refresh ( Button for refreshing the status of an AWS stack creation. ) to update the stack creation status. After the stack is created, return to Dremio. You’ll know the project was successfully created when you see the SQL Runner page in the Sonar project. You can immediately run a query in the SQL Runner, which displays a SELECT * statement on the NYC-taxi-trips table from Dremio’s Samples bucket in Amazon S3.
    Your first query in the Sonar SQL Runner.

    Step 2. Add an Arctic Catalog

    Next, you need to create an Arctic catalog to store table information and enable you to update and manage your tables using Git-like operations. Although Arctic has its own service UI, you can also create an Arctic catalog directly from your Sonar project. In this Getting Started Guide, you’ll do the latter.

    To add an Arctic catalog to your Sonar project:

    1. From the Datasets page, in the bottom of the datasets panel, click (+) Add Source. A Sonar project's dataset page highlighting Add Source.

    note:

    You can also click the plus (+) icon next to Sources.

    1. In the Add Data Source dialog box, under the Data-as-code section, select Arctic (Preview).
      The Add Data Source dialog box.

    2. In the New Arctic (Preview) Source dialog box > General section > Choose a catalog, open the drop-down list of catalogs and select + New Arctic Catalog.

    note:

    For this tutorial, you do not need to configure the other tabs (Storage, Advanced Options, and Privileges), which are set with defaults that you can edit as needed. For more information, see Connecting to Dremio Arctic.

    The New Arctic Source dialog box.
    1. In the New Arctic Catalog dialog box, in the Catalog Name field, enter a name for your catalog. For this tutorial, enter MyTestCatalog.
      The New Arctic Catalog dialog box.

    2. Click Add.

    3. After the catalog has been added, click Save in the New Arctic (Preview) Source dialog box.

    You’ll see the new Arctic catalog on the datasets page, under the datasets panel. You’ll be creating a table in this catalog in the next step.
    Project's datasets page highlighting the new Arctic catalog that was created.

    Step 3. Create a Table in the Catalog

    In this step, you’ll use Sonar to create a new table in the Arctic catalog using data from the NYC Taxi Trips data in Dremio’s Samples bucket in Amazon S3. The file format of this dataset is Apache Parquet (purple folder icon).

    To use Sonar to create a table in the Arctic catalog:

    1. From the side navigation bar, select the SQL Runner (This is the icon for SQL Runner.) icon.

    2. In the upper-right corner of the SQL Runner, verify that the Context (the catalog that you are using the dataset that the SQL statement applies to) shows the correct catalog (MyTestCatalog for this tutorial) and Ref (the branch that the SQL statement will use) shows that you are on the main branch of the catalog.
      SQL Runner page with the context and reference highlighted.

    3. To create a table in your catalog’s main branch, run the following SQL statement in the SQL Runner. For this tutorial, you’ll use the table name sampleTaxiTable. The table created uses the Apache Iceberg open table format, and it will store table data and metadata in the bucket in Amazon S3 (the project store that you specified in the CloudFormation template in Step 1).

      For this example, you’ll only create a table from 10 rows of the NYC-taxi-trips dataset to reduce table size and more easily show changes to the table in subsequent steps.

    Creating a Table
    CREATE TABLE sampleTaxiTable AS 
      SELECT * 
      FROM Samples."samples.dremio.com"."NYC-taxi-trips" 
      LIMIT 10
    

    After the query completes, you’ll see the following result, which confirms the creation of the table, where it is stored (Path), and other details:
    Query result snapshot on the SQL Runner page.

    Step 4. Create a New Branch of the Table and Insert New Data Into It

    In this step, you’ll create a branch in your catalog to isolate changes to your table from the main branch. Similar to Git’s version control semantics for code, a branch in Arctic represents an independent line of development isolated from the main branch. Arctic does this with no data copy, and enables you to update, insert, and delete data in tables on new branches without impacting the original table in the main branch. When you are ready to update the main branch from changes in a secondary branch, you can merge them (Step 5). For more information on branches and Arctic’s Git-like operations, see Git-like Data Management.

    You can use Sonar to create a branch in your Arctic catalog using SQL:

    1. On the SQL Runner page in the upper-right corner, verify that the Context shows the correct catalog (MyTestCatalog for this tutorial) and Ref shows that you are on the main branch of the catalog.
    2. In the SQL Runner, run the following SQL statement:

    Creating a Branch
    CREATE BRANCH etl
    

    The query result will show the details about the new branch you created:
    Creating a branch using SQL Runner.

    Next, let’s insert a new row of data into the sampleTaxiTable in the etl branch of the catalog. This insert will isolate it from the main branch of the catalog.

    1. Change the Context from the main branch to the etl branch. First, verify that the Context displays the catalog name (MyTestCatalog). To change the Ref from the main branch to the etl branch:
      1. In the Ref field, select MyTestCatalog to open the branch dialog box.
      2. Under Other Branches, select etl and click Apply.
    2. Insert a new row into sampleTaxiTable using Sonar by running the following SQL statement:

    Inserting a Row
    INSERT INTO sampleTaxiTable 
    VALUES ('2013-02-10 20:00:00.000', 
         9, 
         3.00, 
         10.0, 
         3.0, 
         13.0)
    

    After you see a confirmation message at the top of the page that the row has been inserted into the table in the etl branch, you can verify that your insert statement succeeded by running the following SQL statement:

    Verifying the Insert
    SELECT COUNT(*) FROM sampleTaxiTable
    
    SQL Runner page running a SQL statement to verify that the row has been inserted into the table in the etl branch.

    The result has 11 rows, verifying that your row was correctly inserted (because you started with 10).

    To demonstrate isolation across branches in your catalog, if you run the same query on the main branch, the record count is still 10. This shows that your insert was not added to the main branch. Any queries using the table in the main branch would therefore not be affected by inserts into the etl branch. In a production scenario, this is helpful because you can insert and validate new data in a new branch without affecting workloads on the main branch.

    Step 5. Merge Your Changes Into the Main Branch of the Table

    Now that you’ve inserted new data and validated the changes in the isolated etl branch, you can safely merge these changes into the main branch. In a production scenario, this would be the process to update the main branch that is consumed by production workloads. You’ll use Sonar to merge changes to the main branch:

    1. First, change the Ref from the etl branch to the main branch. Verify that the Context displays the catalog name (MyTestCatalog). To change the Ref from the etl branch to the main branch:
      1. From Ref, hover your cursor over the branch name, etl, and click it to open the branch dialog box.
      2. Under Default, select main and click Apply.
      SQL Runner page selecting the branch name in Ref to change the branch reference.
    2. Merge the etl branch into main by running the following SQL statement:

    Merging a Branch
    MERGE BRANCH etl
    

    When the merge is completed, you’ll see a confirmation message at the top of the page that the etl branch has successfully merged into the main branch. This means that the changes to the table in the etl branch are now also changed in the main branch. The etl branch is retained in the catalog until you manually drop the branch.

    To see that the data in the main branch is updated, you can count the rows using this query:

    Counting the Number of Rows in a Table
    SELECT COUNT(*) FROM sampleTaxiTable 
    

    You’ll now see 11 rows in the table, reflecting the inserted row made in and merged from the etl branch.

    Recover From Mistakes

    You can quickly and efficiently recover from a mistake when working in an Arctic catalog. Every change you make is contained in a commit. So you can simply revert to a previous commit. And since multiple table transactions can be done in a commit, you can roll back changes made to multiple tables easily using one command. The example below demonstrates the command operating on one commit that applies to only one table.

    Take the following steps to get the commit ID and revert to a previous commit:

    1. From the SQL Runner page (from Step 5), under the Data tab, locate MyTestCatalog and click the main branch box.
    2. In the main branch dialog box, right panel, select the etl branch.
    3. On the left panel, click the radio button next to Commit. Notice that all the commits in the etl branch become accessible.
    4. To copy a commit’s ID, hover to the right of a commit ID and click the Copy icon.
      MyTestCatalog dialog box showing available branches.
    5. Then in the SQL Runner, take the following actions:
      a. In the upper-right corner, verify that the Context shows the correct catalog (MyTestCatalog) and the Ref shows that you are on the etl branch of the catalog.
      b. Run the following query to revert your etl branch to the previous commit:

    Reverting to a Previous Commit
    ALTER BRANCH etl 
      ASSIGN COMMIT "<commit-id>"
    

    You can query the table and verify that it has reverted to the previous commit:

    Querying the Table
    SELECT * FROM sampleTaxiTable
    

    Wrap-Up and Next Steps

    Congratulations! You have successfully completed this tutorial and used Arctic’s Git-like operations to manage your data as code.

    Arctic works with other engines including Apache Spark. For a tutorial on using Arctic with Apache Spark, see Getting Started with Apache Spark and Arctic.

    To learn more about Dremio Arctic, see: