Skip to main content

SELECT

Dremio supports querying using standard SELECT statements. You can query tables and views that are contained in connected sources and Arctic catalogs.

When working with Apache Iceberg tables, you can query a table's metadata as well as run queries by snapshot ID.

note

For Apache Iceberg tables, Dremio supports only the copy-on-write storage mechanism and reads only the latest data files for each Iceberg v2 table that you run SQL commands against. Dremio does not support Iceberg v2 tables that have merge-on-read manifests.

Syntax
[ WITH ... ]
SELECT [ ALL | DISTINCT ]
{ * | <column_name1>, <column_name2>, ... }
FROM { <table_name> | <view_name>
| TABLE ( <iceberg_metadata> ( <table_name> ) )
| UNNEST ( <list_expression> ) [ WITH ORDINALITY ] }
[ { PIVOT | UNPIVOT } ( <expression> ) ]
[ WHERE <condition> ]
[ GROUP BY <expression> ]
[ QUALIFY <expression> ]
[ ORDER BY <column_name1>, <column_name2>, ... [ DESC ] ]
[ LIMIT <count> ]
[ AT { { REF[ERENCE] | BRANCH | TAG | COMMIT } <reference_name>
[ AS OF <timestamp> ]
| { SNAPSHOT <snapshot_id> | <timestamp> } } ]

Parameters

[ WITH ... ] String   Optional

Defines a common table expression (CTE), which is a named subquery. For more information, read WITH.


[ ALL | DISTINCT ] String   Optional

Specifies the result set that is returned. Similar to the asterisk (*), ALL returns all the values in the result set. DISTINCT eliminates duplicates from the result set. If you do not specify an option, the default is ALL.


*

Indicates that you want to query all columns in the table.


<column_name1>, <column_name2>, ... String

The name of the column(s) that you want to query.


FROM { <table_name> | <view_name> } String

The name of the table or view that you want to select from. The table can be in the scratch directory or a data lake source.


FROM TABLE ( <iceberg_metadata> ( <table_name> ) ) String

The <table_name> is the name of the Iceberg table that you want to select from and the name must be enclosed in single quotes. For <iceberg_metadata>, Iceberg includes helpful system-table references which provide easy access to Iceberg-specific information on tables, including:

  • The data files for a table
  • The history of a table
  • The manifest files for a table
  • The snapshots for a table

Supported Iceberg metadata clauses include:

  • TABLE ( table_files ( <table_name> ) ): Query an Iceberg table's data file metadata using the table_files() function. Dremio returns records that have these fields:

    ColumnData TypeDescription
    file_pathVARCHARFull file path and name
    file_formatVARCHARFormat, for example, PARQUET
    partitionVARCHARPartition information
    record_countBIGINTNumber of rows
    file_size_in_bytesBIGINTSize of the file
    column_sizesVARCHARList of columns with the size of each column
    value_countsVARCHARList of columns with the number of records with a value
    null_value_countsVARCHARList of columns with the number of records as NULL
    nan_value_countsVARCHARList of columns with the number of records as NaN
    lower_boundsVARCHARList of columns with the lower bound of each
    upper_boundsVARCHARList of columns with the upper bound of each
    key_metadataVARCHARKey metrics
    split_offsetsVARCHARSplit offsets
  • TABLE ( table_history ( <table_name> ) ): Query an Iceberg table's history metadata using the table_history() function. Dremio returns records that have these fields:

    ColumnData TypeDescription
    made_current_atTIMESTAMPThe timestamp the Iceberg snapshot was made at
    snapshot_idVARCHARThe Iceberg snapshot
    parent_idVARCHARThe parent snapshot ID, null if not exists
    is_current_ancestorBOOLEANIf the snapshot is part of the current history, shows abandoned snapshots
  • TABLE ( table_manifests ( <table_name> ) ): Query an Iceberg table's manifest file metadata using the table_manifests() function. Dremio returns records that have these fields:

    ColumnData TypeDescription
    pathVARCHARFull path and name of the manifest file
    lengthBIGINTSize in bytes
    partition_spec_idVARCHARID of the partition
    added_snapshot_idVARCHARID of the snapshot added to the manifest
    added_data_files_countBIGINTNumber of new data files added
    existing_data_files_countBIGINTNumber of existing data files
    deleted_data_files_countBIGINTNumber of files removed
    partition_summariesVARCHARPartition information
  • TABLE ( table_snapshot ( <table_name> ) ): Query an Iceberg table's snapshot metadata using the table_snapshot() function. Dremio returns records that have these fields:

    ColumnData TypeDescription
    committed_atTIMESTAMPThe timestamp the Iceberg snapshot was committed
    snapshot_idVARCHARThe Iceberg snapshot ID
    parent_idVARCHARThe parent snapshot ID, null if it does not exist
    operationVARCHARThe Iceberg operation (for example, append)
    manifest_listVARCHARList of manifest files for the snapshot
    summaryVARCHARAdditional attributes (records added, etc.)

FROM UNNEST ( <list_expression> ) [ WITH ORDINALITY ] String   Optional

Expands a LIST into a table with a single row for each element in the LIST. WITH ORDINALITY returns an additional column with the offset of each element, which can be used with the ORDER BY clause to order the rows by their offsets. UNNEST cannot unnest a correlated variable.


{ PIVOT | UNPIVOT } ( <expression> ) String   Optional

PIVOT converts a set of data from rows into columns. UNPIVOT converts a set of data from columns into rows. The expression can be one of the following:

  • pivot_clause: The query to aggregate the data on.
  • pivot_for_clause: Which columns to group and pivot on.
  • pivot_in_clause: Filters the values for the columns pivot_for_clause. Each of the values in this clause will be a separate column.
note

This keyword is applied to a SELECT statement. The syntax does not support an alias between the table/subquery and either the PIVOT or UNPIVOT clause. For example, SELECT (name, dept FROM employees) <alias> PIVOT <query> is not supported.


WHERE <condition> Boolean   Optional

Filters your query and extracts only the records that fulfill a specified condition. The following operators can be used: +, >, <, >=, <=, { <> | != }, BETWEEN, LIKE, IN. Additionally, <condition> can include logical operators, such as AND, OR, and NOT.


GROUP BY <expression> String   Optional

Groups rows with the same group-by-item expressions and computes aggregate functions (such as COUNT(), MAX(), MIN(), SUM(), AVG() ) for the resulting group. A GROUP BY expression can be one or more column names, a number referencing a position in the SELECT list, or a general expression.


QUALIFY <expression> Boolean;Optional

Filters the results of window functions. To use QUALIFY, at least one window function must be present in either the SELECT statement or within the QUALIFY expression. The expression filters the result after aggregates and window functions are computed; it can also contain window functions. The boolean expression can be the result of a subquery.


ORDER BY <column_name1>, <column_name2>, … [ DESC ] String   Optional

Sort the result by a specific column. By default, the records are sorted in ascending order. Use DESC to sort the records in descending order.


LIMIT <count> Integer   Optional

Constrains the maximum number of rows returned by the query. Must be a non-negative integer.


AT { REF[ERENCE] | BRANCH | TAG | COMMIT } <reference_name> String   Optional

Specifies a reference to run the query against. When this parameter is omitted, the current reference is used.

  • REF: Identifies a reference to run the query against, which can be a branch, tag, or commit.
  • BRANCH: Identifies the branch reference to run the query against.
  • TAG: Identifies the tag reference to run the query against.
  • COMMIT: Identifies the commit reference to run the query against. Commit hashes must be enclosed in double quotes (for example, “ff2fe50fef5a030c4fc8e61b252bdc33c72e2b6f929d813833d998b8368302e2”).

AS OF <timestamp> String   Optional

Changes the commit reference point to the provided timestamp. Can only be applied to REF, BRANCH, and TAG. <timestamp> may be any SQL expression that resolves to a single timestamp type value, for example: CAST( DATE_SUB(CURRENT_DATE,1) AS TIMESTAMP ) or TIMESTAMP '2022-07-01 01:30:00.000'.


AT SNAPSHOT <snapshot_id> String   Optional

Applies to Iceberg and Delta Lake tables only. A time-travel query that enables you to specify an earlier version of a table to read. A snapshot ID is obtained either through the table_history() or table_snapshot() metadata function. Must be enclosed in single quotes.


AT <timestamp> String   Optional

Available for Iceberg and Delta Lake table queries only. Changes the commit reference point to the most recent Iceberg snapshot as of the provided timestamp. <timestamp> may be any SQL expression that resolves to a single timestamp type value, for example: CAST( DATE_SUB(CURRENT_DATE,1) AS TIMESTAMP ) or TIMESTAMP '2022-07-01 01:30:00.000'.

Examples

Query an existing table in a data lake source
SELECT *
FROM Samples."samples.dremio.com"."zips.json";
Query a specified column in an existing table
SELECT city
FROM Samples."samples.dremio.com"."zips.json";
Query a table using the DISTINCT option to eliminate duplicates from the result set
SELECT DISTINCT city
FROM Samples."samples.dremio.com"."zips.json";
Query a table and filter the results using the WHERE clause
SELECT *
FROM Samples."samples.dremio.com"."zips.json"
WHERE state = 'MA' AND city = 'AGAWAM';
Query a table and filter the result using QUALIFY with window functions in the SELECT list
SELECT passenger_count, trip_distance_mi, fare_amount,
RANK() OVER (PARTITION BY passenger_count ORDER BY trip_distance_mi) AS pc_rank
FROM "NYC-taxi-trips"
QUALIFY pc_rank = 1;
Query a table and filter the result using QUALIFY with window functions in the QUALIFY clause
SELECT passenger_count, trip_distance_mi, fare_amount
FROM "NYC-taxi-trips"
QUALIFY RANK() OVER (PARTITION BY passenger_count ORDER BY trip_distance_mi) = 1;
Query a table and group and order the result by the specified expression
SELECT COUNT(city), city, state
FROM Samples."samples.dremio.com"."zips.json"
GROUP BY state, CITY
ORDER BY COUNT(city) DESC;
Query an existing table using a CTE clause
WITH cte_quantity (Total)
AS (
SELECT SUM(passenger_count) as Total
FROM Samples."samples.dremio.com"."NYC-taxi-trips" where passenger_count > 2
GROUP BY pickup_datetime
)
SELECT AVG(Total) average_pass
FROM cte_quantity;
Query a table using the PIVOT and UNPIVOT clauses
ALTER DATASET Samples."samples.dremio.com"."SF weather 2018-2019.csv" REFRESH METADATA auto promotion FORCE UPDATE;

SELECT * FROM (
SELECT EXTRACT(YEAR FROM CAST(F AS DATE)) as "YEAR",
EXTRACT(MONTH FROM CAST(F AS DATE)) as "MONTH",
K as MAX_TEMP
FROM Samples."samples.dremio.com"."SF weather 2018-2019.csv"
where F <> 'DATE'
)
PIVOT (
max(MAX_TEMP) for "MONTH" in (1 as JAN, 2 as FEB, 3 as MAR, 4 as APR, 5 as MAY, 6 as JUN, 7 as JUL, 8 as AUG, 9 as SEP, 10 as OCT, 11 as NOV, 12 as "DEC")
)
UNPIVOT (
GLOBAL_MAX_TEMP for "MONTH" in (JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, "DEC")
)
ORDER BY "YEAR", "MONTH";
Return a table with a single row for each element in a LIST
SELECT   index,
UPPER(array_item)
FROM UNNEST ( ARRAY [ 'a', 'b', 'c' ]) WITH ORDINALITY AS my_table ( array_item, index)
ORDER BY index;
Query a table using a branch reference
SELECT *
FROM myCatalog.demo_table AT REF main_branch;
Query a view using a commit
SELECT *
FROM myCatalog.demo_view AT COMMIT "7f643f2b9cf250ce1f5d6ff4397237b705d866fbf34d714";
Time travel query on an Iceberg table using a timestamp
SELECT *
FROM myTable AT TIMESTAMP '2022-01-01 17:30:50.000';
Time travel query on Iceberg table using a snapshot ID
SELECT *
FROM myTable AT SNAPSHOT '5393090506354317772';
Query an Iceberg table using the table's history metadata and a snapshot ID
SELECT *
FROM TABLE(table_history('myTable'))
WHERE snapshot_id = 4593468819579153853;
Query an Iceberg table to find the number of snapshots for a table
SELECT count(*)
FROM TABLE(table_snapshot('myTable'))
GROUP BY snapshot_id;

Column Aliasing

If you specify an alias for a column or an expression in the SELECT clause, you can refer to that alias elsewhere in the query, including in the SELECT list or in the WHERE clause.

Example 1
SELECT c_custkey AS c, lower(c)
FROM "customer.parquet";
Example 2
SELECT c_custkey AS c, lower(c)
FROM (
SELECT c_custkey, c_mktsegment AS c
FROM "customer.parquet");
Example 3
SELECT  c_name AS n, n
FROM (
SELECT c_mktsegment AS n, c_name
FROM "customer.parquet")
AS MY_TABLE
WHERE n = 'BUILDING';
Example 4
SELECT c_custkey
FROM (
SELECT c_custkey, c_name AS c
FROM "customer.parquet" )
WHERE c = 'aa';
Example 5
SELECT *
FROM (
SELECT c_custkey AS c, c_name
FROM "customer.parquet" )
JOIN "orders.parquet" ON c = o_orderkey;
Example 6
SELECT c_custkey AS c
FROM "customer.parquet"
JOIN "orders.parquet" ON c = o_orderkey;

Distributing Data Evenly Across Execution Engines During Joins

You can use a BROADCAST hint if a query profile indicates that data involved in a join of two tables is heavily skewed and overloading one or more execution engines. The hint forces an even distribution of the data across all execution engines.

note

These hints are ignored for nested-loop joins and are not supported on views.

A BROADCAST hint must be used immediately after the name of a table.

Syntax of a BROADCAST hint
/*+ BROADCAST */
Example 1
SELECT *
FROM T1 /*+ BROADCAST */
INNER JOIN t2 ON t1.key = t2.key
INNER JOIN t3 ON t2.key = t3.key;
Example 2
SELECT *
FROM T1
INNER JOIN (select key, max(cost) cost from t2 /*+ BROADCAST */) T2 ON t1.key = t2.key
INNER JOIN t3 ON t2.key = t3.key;

Querying information about rejected records in CSV or JSON files used by a COPY INTO operation for which ON_ERROR was set to 'continue'

Queries use the copy_errors() function.

Syntax
SELECT * 
FROM TABLE(copy_errors(<table_name>, [<query_id>])
  • <table_name>
    The name of the target table on which the COPY INTO operation was performed.

  • <query_id>
    Optional parameter. The ID of the job that ran the COPY INTO operation. You can obtain this ID from the SYS.COPY_ERRORS_HISTORY system table. If you do not specify an ID, the default value is the ID of the last job started by the current user to run COPY INTO on the target table.

The records returned consist of these fields:

ColumnData TypeDescription
job_idstringThe ID of the job that ran the COPY INTO operation.
file_namestringThe full path of the file where the validation error was encountered.
line_numberlongThe number of the line (physical position) in the file where the error was encountered.
row_numberlongThe row (record) number in input file.
column_namestringThe name of the column where the error was encountered.
errorstringA message describing the error.