On this page

    Getting Started with Dremio Cloud

    Welcome to the Getting Started guide! By the time you reach the end of it, you should feel confident that you have a basic, but solid, knowledge of how to use Dremio Cloud to achieve your data-related goals.

    What This Guide Covers

    • Adding a data lake to a project
    • Creating a physical dataset from source data
    • Creating a virtual dataset
    • Querying a virtual dataset
    • Accelerating a query with a reflection

    Signing Up for Dremio Cloud

    If you don’t already have a Dremio Cloud account, there are two ways to get one:

    • Sign up for an account, and create your own Dremio Cloud organization in the process. See Signing Up for Dremio Cloud.
    • Be invited to a Dremio Cloud organization, then follow the instructions sent to your email account.

    Before You Start

    You might be more comfortable with following the steps in this guide if you first get an overview of how Dremio Cloud works.


    Step 1: Add Dremio’s Samples Data Lake to Your Project

    In this first step, you add Dremio Cloud’s Samples data lake, which includes a number of sample datasets that you can practice with.

    Why

    Data resides in sources. A source is a data lake or external source (such as a relational database). To query data through Dremio Cloud, you first have to connect to the source of the data.

    The Samples data lake contains (among other datasets) the NYC-taxi-trips dataset, which is made available here by the NYC Taxi and Limousine Commission. The rest of this guide uses this dataset.

    How

    To add the Samples data lake, click the Add Sample Source button in the Data Lakes section of the Datasets page.

    The Samples data lake is now added to the project, as shown in this image:

    LocationDescription
    1The Samples data lake is listed in the Data Lakes section.
    2The title indicates that the Samples data lake is open.
    3The page lists the contents of the Samples data lake.

    Step 2: Create a Physical Dataset

    In this next step, you learn how to represent a folder of source files as a physical dataset.

    Why

    Dremio Cloud is for querying data that is in data lakes and other data sources. Data is stored in files that can be in many different formats. So that you can run standard SQL queries against data that is in different formats, you create entities that represent variously formatted data as tables. These entities are called physical datasets because they represent actual data. You can create a PDS from one data file or from a folder that contains data files.

    Physical datasets are the foundation on which you create virtual datasets. A virtual dataset (VDS) is a view derived from physical datasets or other virtual datasets. Virtual datasets are not copies of the data, so they use very little memory and always reflect the current state of the parent datasets they are derived from.

    The following figure shows schematically how data is represented in Dremio Cloud:

    In this step, you tell Dremio Cloud to consider all of the files in the NYC-taxi-trips folder as one physical dataset. As a result, you can create virtual datasets for different users and different uses of the NYC-taxi-trips data. However, in this guide, you create only one virtual dataset.

    How

    To create a physical dataset from the files in the NYC-taxi-trips folder:

    1. Click the samples.dremio.com folder to open it.

    2. Locate the NYC-Taxi-trips folder.

    3. Click the Format Folder icon in the Actions column for the NYC-Taxi-trips folder. You must hover your mouse cursor over the row for the folder to be able to see the button.

      The Dataset Settings dialog opens. The file format is automatically detected as Parquet and the dialog displays a subset of the data in the NYC-taxi-trips dataset.

    4. Click Save.

    Dremio Cloud creates the physical dataset (without making copies of the underlying data), runs a SELECT * statement as a preview query (which limits the number of rows returned to 70,000), opens the SQL Runner, and displays the statement and query results.

    LocationDescription
    1The indicator that displays the name of the open dataset. The icon on the left indicates the type of dataset. In this case, the icon is for a physical dataset created from a source folder.
    2The Data page is the page on which the SQL Runner is located.
    3The Data field lists the spaces and data sources in the current project. The Samples data lake is highlighted because the NYC-taxi-trips dataset is in this data lake.
    4This field shows the query that was used to generate the result set.
    5The field gives the number of rows in the result set.
    6This table displays the result set.

    Step 3: Create a Virtual Dataset in Your Homespace

    In this step, you learn how to run a query against your physical dataset, and then save that query as a virtual dataset in a space, which in this case is your homespace.

    Why

    After you create physical datasets that represent your source data as tables, you create one or more layers of virtual datasets. Think of virtual datasets as views. Users can run queries on the virtual datasets when they do not need all of the data in a physical dataset. Virtual datasets can also transform, filter, and join data, so that user queries do not have to. With virtual datasets, you can even restrict users to accessing only specific views of the data that is represented in physical datasets.

    How

    To create a virtual dataset from the physical dataset:

    1. Paste this query into the SQL Runner.

      The query returns the average tip and total amounts grouped by pick up times. It returns average tip and total amounts grouped by pick-up times for trips in NYC taxis. You can see whether there are trends in the time of day that passengers are most generous with their tips or when NYC taxis generate the most revenue.

      SELECT "pickup_datetime", AVG("tip_amount") AS "Avg Tip", AVG("total_amount") AS "Avg Total"
      FROM "NYC-taxi-trips"
      GROUP BY "pickup_datetime"
      
    2. Click Run.

      After the result set appears, Dremio Cloud displays this message: “A subset (1,047,552) of the total number of rows has been returned for optimized performance. Execute using ODBC/JDBC to retrieve the complete result set.”

      Also, note the amount of time that the query ran. The time appears to the left of the count of records.

    3. Click the Save button in the top-right corner, and then select Save As.

    4. In the Save Dataset As dialog, specify a name for the virtual dataset, such as nyc-taxi-vds, and then click Save.

    5. In the Data field on the left side of the SQL Runner, expand your homespace.

      The Data field lists the datasets in the spaces and data sources in your project. Your new virtual dataset is listed in your homespace.

    LocationDescription
    1The indicator that displays the name of the open dataset, which in this case is the name that you gave to your new virtual dataset. The icon on the left indicates the type of dataset. In this case, the icon is for a virtual dataset. Below the name is the name of the space in which the virtual dataset is located.
    2The Data field lists the virtual dataset in your homespace.
    3The listing for the virtual dataset also lists the fields that are in the dataset.

    Step 4: Run a Query Against Your Virtual Dataset

    In this step, you learn how to set the context of a query.

    Why

    The context tells the SQL Runner where in the project you want to run a query from. For example, if you run a query against your virtual dataset from the current context “Samples.samples.dremio.com”, the SQL Runner will not be able to find your dataset, which is not located in that context. If you want to run the query as it is, you need to set the context to your homespace.

    How

    To set the context of a query:

    1. Copy this query, paste it into the query field, and replace <name of your virtual dataset> with (you guessed it!) the name of your virtual dataset:

      SELECT * FROM <name of your virtual dataset>
      
    2. Set the context of the query that you are going to run.

      a. Click “Samples.samples.dremio.com” in the Context field at the top-right corner of the query field.

      b. In the Select Context dialog, select your homespace.

      c. Click Select. The Context field now is set to your homespace.

      Another way to tell the SQL Runner the location of a dataset that a query references is to use the full path of the dataset in your query. In the top-left corner of the SQL Runner, hover over the name of the space, which is directly below the name of your dataset. A clipboard icon appears. Clicking the icon copies the path to your computer’s clipboard. You can then paste the path into a query.

      Yet another way is to place your cursor in the position in a query where you want to insert a reference to a dataset. Then, locate the dataset in the Data field on the left side of the SQL Runner. Hover over the name of the dataset. A plus sign appears to the right of it. Click the plus sign. The full path to the dataset appears in the query.

      Of course, whenever you open the SQL Runner by clicking a dataset on the Datasets page, the SQL Runner displays the context of that dataset by default in the Context field, assuming that you want to run a query on that dataset.

    3. Click Run.

    LocationDescription
    1The Records field tells you how many records are in the result set. The SQL Runner limits result sets to around one million records. If you want to a result that includes all matching records beyond what the SQL Runner shows, you can use one of the supported BI clients, a custom client application, or Dremio Cloud’s SQL REST API.
    2This field shows the total execution time for a query that you have run. Note the time that your own query took to run in this step. You are going to make the query run faster in the next step.

    Step 5: Create a Reflection to Accelerate the Query

    In this step, you learn how you can make the execution time of your query unbelievably short by using a reflection.

    Why

    A reflection is an optimized materialization of source data or a query, similar to a materialized view, that is derived from an existing virtual or physical dataset.

    Dremio’s query optimizer can accelerate a query against datasets by using one or more reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source. Queries do not need to reference reflections directly. Instead, Dremio rewrites queries on the fly to use the reflections that satisfy them.

    There are two main types of reflections: raw and aggregation. Let’s create a raw reflection. It will have the same number of records as the virtual dataset, but it will be fraction of the size of the virtual dataset. In many cases, only a small subset of the columns are queried by users, and therefore, it makes sense to include only those columns in the raw reflection.

    How

    To create a raw reflection that is based on your virtual dataset:

    1. Click the Reflections icon at the top of the SQL Runner.

      The first location in the image is the icon for the Data page. The Data page is open, as you can tell by the fact that the icon is not grayed out.

      The second location is the icon for the Reflections page.

    1. On the Reflections page, click the toggle switch in the Raw Reflections bar.

      The page shows the Basic view. Clicking the toggle switch in this view creates a raw reflection that includes all of the fields of the dataset. In the Advanced view, you can define a fancier raw reflection, and we encourage you to do that on your own after you finish this guide.

    2. Click Save in the bottom-right corner of the page.

    3. Wait for Dremio Cloud to create the reflection.

      Dremio Cloud runs a job to create the raw reflection. To see the progress of the job, so that you know when you can re-run your query to use the reflection, click the History link on the right side of the Raw Reflections bar, as shown in the first image.

      Dremio Cloud opens the Jobs page and shows the job that it is running to create the reflection. On the left end of the job is an icon that tells you the job’s status. When the icon changes to a white checkmark in a green circle, the job is complete.

    1. Click your virtual dataset and display the SQL Runner.

      a. Click the Datasets icon in Dremio Cloud’s sidebar on the left side of the screen.

      You are now back in your homespace.

      b. Click the name of your virtual dataset.

      The SQL Runner opens and runs a preview query on your dataset. The preview query is a full select, the same query that you ran earlier.

    2. Click Run to run the query.

    The query should run much faster than it did when you ran it earlier. However, it might be the case that the query engine is asleep, and Dremio Cloud is starting it up again. Starting the engine takes time. After the query returns the result set, run it again, now that the engine is fully started. The execution time should be much faster, showing the benefit of the raw reflection.


    Finished!

    Congratulations for completing this tutorial, and many thanks to you for taking time out of your busy workday to do it!

    To learn more about Dremio Cloud, try out these other tasks: