On this page

    SELECT

    Dremio supports querying using standard SELECT statements. You can query tables and views that are contained in Dremio’s space entity (including your home space). You can query tables contained in the scratch directory and in the sources that are connected to Dremio (such as Amazon S3 and Oracle).

    When working with Apache Iceberg tables, you can query a table’s metadata as well as run queries by snapshot ID.

    note:

    For Apache Iceberg tables, Dremio supports only the copy-on-write storage mechanism and reads only the latest data files for each Iceberg v2 table that you run SQL commands against. Dremio does not support Iceberg v2 tables that have merge-on-read manifests.

    Querying a Table’s Data

    Syntax
    [ WITH <CTE_name> [ ( cte_column1, cte_column2, ... ) ] 
       AS ( query ), 
       
    SELECT [ ALL | DISTINCT ] 
       {
       *
       | <column_name1>, <column_name2>, ... 
       }
    FROM <table_path>
       <!-- Note: The following syntax applies to Apache Iceberg tables only
       [
          AT SNAPSHOT '<snapshot_id>'
       ]
       | <iceberg_metadata> 
       -->
       
    [ { PIVOT | UNPIVOT } ( <expression> ) ]
    [ WHERE <condition> ]
    [ GROUP BY <expression> ]
    [ QUALIFY <expression> ]
    [ ORDER BY <column_name1>, <column_name2>, ... [ DESC ] ]
    [ LIMIT <count> ]
    

    Parameters

    *

    Indicates that you want to query all columns in the table.


    <column_name1>, <column_name2>, ...

    String

    The name of the column(s) that you want to query.


    <table_path>

    String

    The path to the source that you want to query a table in. The source can be in a space, your scratch directory, or a data source.

    Parameters for Iceberg Tables Only

    AT SNAPSHOT '<snapshot_id>'

    String

    A time-travel query that enables you to specify an earlier version of a table to read. A snapshot ID is obtained either through the table_history() or table_snapshot() metadata function.


    <iceberg_metadata>

    Iceberg includes helpful system-table references which provide easy access to Iceberg-specific information on tables, including:

    • The data files for a table

    • The history of a table

    • The manifest files for a table

    • The snapshots for a table

    Supported Iceberg metadata clauses include:
    • TABLE( table_files( '<table_path>' ) ): Query an Iceberg table's data file metadata using the table_files() function. Dremio returns records that have these fields:

      ColumnData TypeDescription
      file_pathVARCHARFull file path and name
      file_formatVARCHARFormat, for example, PARQUET
      record_countBIGINTNumber of rows
      file_size_in_bytesBIGINTSize of the file
      column_sizesVARCHARList of columns with the size of each column
      value_countsVARCHARList of columns with the number of records with a value
      null_value_countsVARCHARList of columns with the number of records as NULL
      nan_value_countsVARCHARList of columns with the number of records as NaN
      lower_boundsVARCHARList of columns with the lower bound of each
      upper_boundsVARCHARList of columns with the upper bound of each
      key_metadataVARCHARKey metrics
      split_offsetsVARCHARSplit offsets

    • TABLE( table_files( '<table_history>' ) ): Query an Iceberg table's history metadata using the table_history() function. Dremio returns records that have these fields:

      ColumnData TypeDescription
      made_current_atTIMESTAMPThe timestamp the Iceberg snapshot was made at
      snapshot_idVARCHARThe Iceberg snapshot
      parent_idVARCHARThe parent snapshot ID, null if not exists
      is_current_ancestorBOOLEANIf the snapshot is part of the current history, shows abandoned snapshots

    • TABLE( table_files( '<table_manifests>' ) ): Query an Iceberg table's manifest file metadata using the table_manifests() function. Dremio returns records that have these fields:

      ColumnData TypeDescription
      pathVARCHARFull path and name of the manifest file
      lengthBIGINTSize in bytes
      partition_spec_idVARCHARID of the partition
      added_snapshot_idVARCHARID of the snapshot added to the manifest
      added_data_files_countBIGINTNumber of new data files added
      existing_data_files_countBIGINTNumber of existing data files
      deleted_data_files_countBIGINTNumber of files removed
      partition_summariesVARCHARPartition information

    • TABLE( table_files( '<table_snapshot>' ) ): Query an Iceberg table's manifest file metadata using the table_snapshot() function. Dremio returns records that have these fields:

      ColumnData TypeDescription
      committed_atTIMESTAMPThe timestamp the Iceberg snapshot was committed
      snapshot_idVARCHARThe Iceberg snapshot ID
      parent_idVARCHARThe parent snapshot ID, null if it does not exist
      operationVARCHARThe Iceberg operation (for example, append)
      manifest_listVARCHARList of manifest files for the snapshot
      summaryVARCHARAdditional attributes (records added, etc.)

    Optional Parameters

    WITH <CTE_name> [ ( cte_column1, cte_column2, ... ) ] AS ( query ), ]

    String

    Dremio supports common table expression (CTE), which is a named subquery defined in a WITH clause. A CTE is a temporary view that is used in the statement (that is defining the CTE). This expression defines the temporary view’s name, an optional list of column names, and a query expression (that is, a SELECT statement). The following parameters are used:

    • <CTE_name>: The name of the CTE you are defining. The CTE must have a unique name within a given query.
    • <cte_column#>: The names of the columns from the query that defines the CTE.
    • <query>: The query (SELECT) statement that defines the CTE.


    [ ALL | DISTINCT ]

    String

    Specifies the result set that is returned. Similar to the asterisk (*), ALL returns all the values in the result set. DISTINCT eliminate duplicates from the result set. If you do not specify an option, the default is ALL.


    WHERE <condition>

    Boolean

    Use the WHERE clause to filter your query and extract only the records that fulfill a specified condition. The following operators can be used: +, >, <, >=, <=, { <> | != }, BETWEEN, LIKE, IN. Additionally, <condition> can include logical operators, such as AND, OR, and NOT.


    { PIVOT | UNPIVOT } ( <expression> )

    String

    PIVOT converts a set of data from rows into columns. UNPIVOT converts a set of data from columns into rows. The expression can be one of the following:

    • pivot_clause: The query to aggregate the data on.
    • pivot_for_clause: Which columns to group and pivot on.
    • pivot_in_clause: Filters the values for the columns pivot_for_clause. Each of the values in this clause will be a separate column.

    Note:

    This keyword is applied to a SELECT statement. The syntax does not support an alias between the table/subquery and either the PIVOT or UNPIVOT clause. For example, SELECT name, dept FROM employees) <alias> PIVOT <query> is not supported.


    GROUP BY <expression>

    String

    Groups rows with the same group-by-item expressions and computes aggregate functions (such as COUNT(), MAX(), MIN(), SUM(), AVG() ) for the resulting group. A GROUP BY expression can be one or more column names, a number referencing a position in the SELECT list, or a general expression.


    QUALIFY <expression>

    Boolean

    Filters the results of window functions. To use QUALIFY, at least one window function must be present in either the SELECT statement or within the QUALIFY expression. The expression filters the result after aggregates and window functions are computed; it can also contain window functions. The boolean expression can be the result of a subquery.


    ORDER BY <column_name1>, <column_name2>, … [ DESC ]

    String

    Sorts the result by a specific column. By default, the records are sorted in ascending order. Use DESC to sort the records in descending order.


    LIMIT <count>

    Integer

    Constrains the maximum number of rows returned by the query. Must be a non-negative integer.

    Examples

    note:

    The following examples use Dremio’s Samples Data Lake, which includes a number of sample datasets that you can practice with.

    Query an existing table in a data lake source
    SELECT * 
    FROM Samples."samples.dremio.com"."zips.json"
    
    Query a specified column in an existing table
    SELECT city 
    FROM Samples."samples.dremio.com"."zips.json"
    
    Query a table using the DISTINCT option to eliminate duplicates from the result set
    SELECT DISTINCT city 
    FROM Samples."samples.dremio.com"."zips.json"
    
    Query a table and filter the results using the WHERE clause
    SELECT * 
    FROM Samples."samples.dremio.com"."zips.json"
    WHERE state = 'MA' AND city = 'AGAWAM'
    
    Query a table and group and order the result by the specified expression
    SELECT COUNT(city), city, state
    FROM Samples."samples.dremio.com"."zips.json"
    GROUP BY state, CITY
    ORDER BY COUNT(city) DESC
    
    Query an existing table using a CTE clause
    WITH cte_quantity (Total)
        AS (
            SELECT SUM(passenger_count) as Total
            FROM Samples."samples.dremio.com"."NYC-taxi-trips" where passenger_count > 2
            GROUP BY pickup_datetime
            )
    SELECT AVG(Total) average_pass
    FROM cte_quantity
    
    Query a table and filter the result using QUALIFY with window functions in the SELECT list
    SELECT passenger_count, trip_distance_mi, fare_amount,
    RANK() OVER (PARTITION BY passenger_count ORDER BY trip_distance_mi) AS pc_rank
    FROM "NYC-taxi-trips"
    QUALIFY pc_rank = 1
    
    Query a table and filter the result using QUALIFY with window functions in the QUALIFY clause
    SELECT passenger_count, trip_distance_mi, fare_amount
    FROM "NYC-taxi-trips"
    QUALIFY RANK() OVER (PARTITION BY passenger_count ORDER BY trip_distance_mi) = 1
    
    Query a table using the PIVOT and UNPIVOT clauses
    ALTER DATASET Samples."samples.dremio.com"."SF weather 2018-2019.csv" REFRESH METADATA auto promotion FORCE UPDATE;
    
    SELECT * FROM (
        SELECT  EXTRACT(YEAR FROM CAST(F AS DATE)) as "YEAR",
                EXTRACT(MONTH FROM CAST(F AS DATE)) as "MONTH",
                K as MAX_TEMP
        FROM Samples."samples.dremio.com"."SF weather 2018-2019.csv"
        where F <> 'DATE'
    )
    PIVOT (
        max(MAX_TEMP) for "MONTH" in (1 as JAN, 2 as FEB, 3 as MAR, 4 as APR, 5 as MAY, 6 as JUN, 7 as JUL, 8 as AUG, 9 as SEP, 10 as OCT, 11 as NOV, 12 as "DEC")
    )
    UNPIVOT (
        GLOBAL_MAX_TEMP for "MONTH" in (JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, "DEC")
    )
    ORDER BY "YEAR", "MONTH";
    

    Examples for Iceberg Tables

    Query an Iceberg table using a snapshot ID
    SELECT * 
    FROM myTable AT SNAPSHOT '5393090506354317772'
    
    Query an Iceberg table using the table's history metadata and a snapshot ID
    SELECT * 
    FROM TABLE(table_history('myTable')) 
    WHERE snapshot_id = 4593468819579153853
    
    Query an Iceberg table to find the number of snapshots for a table
    SELECT count(*) 
    FROM TABLE(table_snapshot('myTable')) 
    GROUP BY snapshot_id