Skip to main content

Create a Table & Add Dataset Info

For this module, you will be working with the NYC-taxi-trips.csv file (330+ million rows) in an Amazon S3 bucket.

You will be creating a folder and a table in your catalog. In each of the steps, replace catalog_name with the name of the catalog you created in the previous step.

Step 1. Create a Folder

To create a folder where the table you will create will reside:

  1. Click the SQL Runner This is the icon that represents the SQL runner. icon in the side navigation bar.

  2. Copy and paste the SQL below and replace catalog_name with the name of your catalog. Click Run.

    Create folder in catalog
    CREATE FOLDER "catalog_name"."my_folder";

Step 2. Create an Iceberg Table

To create a table in your catalog:

  1. Click the SQL Runner This is the icon that represents the SQL runner. icon in the side navigation bar.

  2. Copy and paste the SQL below and click Run.

    Create table in folder
    CREATE TABLE "catalog_name"."my_folder"."nyc_trips" (
    pickup_datetime TIMESTAMP,
    passenger_count INT,
    trip_distance_mi FLOAT,
    fare_amount FLOAT,
    tip_amount FLOAT,
    total_amount FLOAT
    );
  3. To populate the nyc_trips table with sample data, run the SQL below.

    Populate table with data
    COPY INTO "catalog_name"."my_folder"."nyc_trips"
    FROM '@Samples/samples.dremio.com/' FILES('NYC-taxi-trips.csv');
  4. To query the populated data, run the SQL below.

    Query table
    SELECT * FROM "catalog_name"."my_folder"."nyc_trips";

Now that you've created your first Iceberg table in the catalog, you can accelerate your queries on this table using reflections.

Step 3. Add Dataset Info to Enhance Discoverability

For users who will be querying and building on the table that you created, add a markdown description and label to help them understand the data.

  1. Browse to the Datasets page and click on my_folder in the upper left corner.

  2. In my_folder, hover over the nyc_trips table that you created and click This is the icon that represents the Edit settings. on the right. Dremio takes you to the Edit Dataset page for nyc_trips, which is indicated by the name and path in the top left.

  3. In the Details tab, on the Dataset Overview panel on the right, click This is the icon that represents the Edit settings. to add a label that will help users identify the dataset, type public-data, and press Enter.

  4. In the Details tab, in the Wiki section, click Edit wiki to add markdown to describe the table and provide helpful content for the user to get started with the data (i.e., examples, usage notes, point of contact for questions or issues). Copy and paste the example below or write your own and click Save.

    Example wiki
    This table contains taxi trips data from NYC. It can be used to determine the taxi trends i.e. frequency of trips, distance traveled, and average fares or tips.

    Questions? Contact gnarly@dremio.com

    ### Columns
    `pickup_datetime`: Time at which the passenger was picked up.
    `passenger_count`: The number of passengers that were picked up for the taxi ride.
    `trip_distance_mi`: The total distance of the trip from pickup to drop off locations in miles.
    * Max: 351.0
    `fare_amount`: The dollar amount that the ride cost.
    * Max: 158995.81
    `tip_amount`: The dollar amount that the customer tipped in addition to the fare_amount.
    * Max: 888.19
    `total_amount`: The total dollar amount that the customer paid (fare + tip).

    ### Examples
    `SELECT DATE_PART('DAY', "catalog_name"."my_folder"."nyc_trips"."pickup_datetime") AS "pickup_day",
    SUM("catalog_name"."my_folder"."nyc_trips"."passenger_count") AS "passenger_count"
    FROM "catalog_name"."my_folder"."nyc_trips"
    GROUP BY pickup_day
    ORDER BY pickup_day;`

Now your users can easily understand and query the my_folder.nyc_trips table you created.

Next: Accelerate Queries Using Reflections