Skip to main content

Getting Started with Dremio

This tutorial introduces key Dremio concepts and tasks including:

note

This tutorial can be completed using your own data sources and datasets; for simplicity, we recommend using the datasets that we’ve provided.

Prerequisites

Before you begin, sign up for Dremio Cloud and verify that you meet the prerequisites. After you’ve signed up, you’ll be logged into your organization and navigated to your organization homepage.

This is a screenshot showing the organization homepage.

You can now get started with adding a Sonar project.

Step 1. Add a Sonar Project

note

This tutorial uses the CloudFormation template for connecting an AWS account, which is the recommended method for configuring project resources. For more information on the resources being created and the permissions granted to Dremio, see the annotated CloudFormation Template.

If you want to manually create cloud and project resources for AWS instead, see Configuring Your Cloud Resources Manually.

After addressing the prerequisites, follow these steps to create your first Sonar project:

  1. On your organization homepage, click Add Sonar Project on the Sonar card.

  2. In the Add Sonar Project dialog, for Project name, specify a name to identify your project. You can change the name of the project after creation.

  3. For Arctic catalog name, specify a name to identify the primary catalog in your project. The name cannot be changed once the catalog has been created.

  4. Select a cloud provider.

  5. Select the region in which compute resources and the project store will be created. For a list of supported regions, see Supported Regions.

  6. (Optional) Add one or more tags to help you identify compute resources for this project in your cloud provider account.

  7. Click Next.

    To continue the configuration, follow the steps for your cloud provider.

  1. Click Launch CloudFormation Template, which will open the AWS console in a new browser tab.

  2. On the “Quick create stack” page in the AWS Console, for Stack name, enter a name to identify the stack. This name must be unique to your AWS account and cannot include underscores.

  3. For VPC, select the VPC in which compute resources will be created.

  4. For Subnets, select one or more subnets where compute resources will be created.

  5. For Project Store, a unique name has been generated for the S3 bucket that will be created in your AWS account as the metadata store for your project. You can specify a different name here but it must be unique across AWS.

  6. For Encryption Type, SSE-S3 is the default encryption mechanism for the project store S3 bucket. You can select SSE-KMS (AWS Managed Key) or SSE-KMS (Customer Managed Key) for additional encryption. If you select SSE-KMS (Customer Managed Key), specify the KMS Key ARN.

  7. Check I acknowledge that AWS CloudFormation might create IAM resources.

  8. Click Create stack.
    It will take about 5 minutes to create the required storage and compute resources for the project.

When stack creation has completed in the AWS Console, return to the Dremio Console to start querying sample data and creating your first Iceberg table.
note

If you encounter errors during resource creation for AWS, see Dremio Cloud Troubleshooting for AWS.

Step 2. Create an Iceberg Table

For this step, you will be working with the NYC-taxi-trips.csv file (330+ million rows) in an Amazon S3 bucket.

You will be creating a folder and a table in your catalog. In each of the steps, replace catalog_name with the name of the catalog you created in the previous step.

Create a Folder

To create a folder where the table you will create will reside:

  1. Click This is the icon that represents the SQL runner. in the side navigation bar.

  2. Copy and paste the SQL below and replace catalog_name with the name of your catalog. Click Run.

    Create folder in catalog
    CREATE FOLDER "catalog_name"."my_folder";

Create a Table

To create a table in your catalog:

  1. Click This is the icon that represents the SQL runner. in the side navigation bar.

  2. Copy and paste the SQL below and click Run.

    Create table in folder
    CREATE TABLE "catalog_name"."my_folder"."nyc_trips" (
    pickup_datetime TIMESTAMP,
    passenger_count INT,
    trip_distance_mi FLOAT,
    fare_amount FLOAT,
    tip_amount FLOAT,
    total_amount FLOAT
    );
  3. To populate the nyc_trips table with sample data, run the SQL below.

    Populate table with data
    COPY INTO "catalog_name"."my_folder"."nyc_trips"
    FROM '@Samples/samples.dremio.com/' FILES('NYC-taxi-trips.csv');
  4. To query the populated data, run the SQL below.

    Query table
    SELECT *
    FROM "catalog_name"."my_folder"."nyc_trips";

Now that you've created your first Iceberg table in the catalog, you can accelerate your queries on this table using reflections.

Step 3. Accelerate Queries with a Reflection

Queries on the nyc_trips table are fast but queries on this table or views built from this table triggered from dashboards need to be sub-second. To optimize queries and get sub-second response times on 330+ million rows, you will create a data reflection. Data reflections organize and optimize the data close to Dremio Sonar’s query engine, using techniques such as pre-computation, columnarization, compression, sorting, partitioning, and data aggregation.

To create an aggregation reflection:

  1. Run the following SQL in the SQL Runner:

    Create reflection on the table
    ALTER TABLE                 "catalog_name"."my_folder"."nyc_trips"
    CREATE AGGREGATE REFLECTION "taxi_reflection"
    USING DIMENSIONS ("pickup_datetime")
    MEASURES (passenger_count,
    trip_distance_mi,
    fare_amount,
    tip_amount,
    total_amount);

    It takes only a few seconds to create this reflection. Creating a reflection on the table allows views created on top of that data to take advantage of the acceleration as well.

    note

    You can also create reflections using the Dremio console. See Creating Reflections to try it out.

  2. After you have created the reflection, run the query below, which takes advantage of the aggregation reflection.

    The reflection was created on the nyc_trips table but will also accelerate queries on views built on this table. When a query like this is triggered from the SQL Runner or a dashboard, data from all 330+ million rows is used to return the result.

    Query to see the impact of the reflection
    SELECT   DATE_PART('DAY', "catalog_name"."my_folder"."nyc_trips"."pickup_datetime") AS "pickup_day",
    SUM("catalog_name"."my_folder"."nyc_trips"."passenger_count") AS "passenger_count"
    FROM "catalog_name"."my_folder"."nyc_trips"
    GROUP BY pickup_day
    ORDER BY pickup_day;

Notice the query runs in less than a second.

Step 4. Add Dataset Info to Enhance Discoverability

For users who will be querying and building on the table that you created, add a markdown description and label to help them understand the data.

  1. Browse to the Datasets page and click on my_folder in the upper left corner.

  2. In my_folder, hover over the nyc_trips table that you created and click This is the icon that represents the Edit settings. on the right. Dremio takes you to the Edit Dataset page for nyc_trips, which is indicated by the name and path in the top left.

  3. In the Details tab, on the Dataset Overview panel on the right, click This is the icon that represents the Edit settings. to add a label that will help users identify the dataset, type public-data, and press Enter.

  4. In the Details tab, in the Wiki section, click Edit wiki to add markdown to describe the table and provide helpful content for the user to get started with the data (i.e., examples, usage notes, point of contact for questions or issues). Copy and paste the example below or write your own and click Save.

    Example wiki
    This table contains taxi trips data from NYC. It can be used to determine the taxi trends i.e. frequency of trips, distance traveled, and average fares or tips.

    Questions? Contact gnarly@dremio.com

    ### Columns
    `pickup_datetime`: Time at which the passenger was picked up.
    `passenger_count`: The number of passengers that were picked up for the taxi ride.
    `trip_distance_mi`: The total distance of the trip from pickup to drop off locations in miles.
    * Max: 351.0
    `fare_amount`: The dollar amount that the ride cost.
    * Max: 158995.81
    `tip_amount`: The dollar amount that the customer tipped in addition to the fare_amount.
    * Max: 888.19
    `total_amount`: The total dollar amount that the customer paid (fare + tip).

    ### Examples
    `SELECT DATE_PART('DAY', "catalog_name"."my_folder"."nyc_trips"."pickup_datetime") AS "pickup_day",
    SUM("catalog_name"."my_folder"."nyc_trips"."passenger_count") AS "passenger_count"
    FROM "catalog_name"."my_folder"."nyc_trips"
    GROUP BY pickup_day
    ORDER BY pickup_day;`

Now your users can easily understand and query the my_folder.nyc_trips table you created.

Wrap-up and Next Steps

Within minutes after creating your first project, you were able to create and populate an Iceberg table, accelerate queries using reflections, and make it easy to understand and query using the wiki and label capabilities.

Key Takeaways

  • You can create an Iceberg table in Dremio and populate it with data quickly. This enables you to know run DML operations on your data out of the box as you would in a traditional data warehouse.
  • You can accelerate your queries (triggered from the SQL Runner, dashboards, ODBC/JDBC) using data reflections.
  • You can enable your users to quickly understand, query, and work with data using the wiki and labels.

Clean Up (Optional)

Run the following SQL commands below from the SQL Runner to drop the objects that were created in this tutorial.

Drop objects
-- drop the table that you created
DROP TABLE "catalog_name"."my_folder"."nyc_trips"
AT BRANCH main;

Next Steps

Continue learning about and familiarizing yourself with the key Dremio concepts and features covered in this tutorial by using the following resources:

To apply the same concepts with your own data, see Connecting to Your Data.

Have Questions?

If you have questions about the concepts in this tutorial or other Dremio features, reach out to us using the chat feature This is the icon that represents the Intercom chat. in the bottom-right corner of the Dremio application.