SELECT
Dremio supports querying using standard SELECT
statements. You can query tables and views that are contained in connected sources and Arctic catalogs.
When working with Apache Iceberg tables, you can query a table's metadata as well as run queries by snapshot ID.
For Apache Iceberg tables, Dremio supports only the copy-on-write storage mechanism and reads only the latest data files for each Iceberg v2 table that you run SQL commands against. Dremio does not support Iceberg v2 tables that have merge-on-read manifests.
[ WITH ... ]
SELECT [ ALL | DISTINCT ]
{
{ * | <column_name1>, <column_name2>, ... }
FROM { <table_name> | <view_name> | TABLE( <iceberg_metadata> ( '<table_name>' ) ) | UNNEST( <list_expression> ) [ WITH ORDINALITY ] }
[ { PIVOT | UNPIVOT } <expression> ]
[ WHERE <condition> ]
[ GROUP BY <expression> ]
[ QUALIFY <expression> ]
[ ORDER BY <column_name1>, <column_name2>, ... [ DESC ] ]
[ LIMIT <count> ]
[ AT {
REF[ERENCE] <reference_name>
| BRANCH <branch_name>
| TAG <tag_name>
| COMMIT <"commit_hash">
| SNAPSHOT '<snapshot_id>'
| <timestamp>
}
]
}
Parameters
[ WITH ... ]
String
Optional
Defines a common table expression (CTE), which is a named subquery.
For more information, read WITH.
[ ALL | DISTINCT ]
String
Optional
Specifies the result set that is returned. Similar to the asterisk (*), ALL
returns all the values in the result set. DISTINCT
eliminates duplicates from the result set. If you do not specify an option, the default is ALL
.
*
Indicates that you want to query all columns in the table.
<column_name1>, <column_name2>, ...
String
The name of the column(s) that you want to query.
{ <table_name> | <view_name> | TABLE( <iceberg_metadata> ( '<table_name>' ) ) }
String
The path to the source that you want to query a table in. The source can be in a space, your scratch directory, an Arctic catalog, or a data source.
For <iceberg_metadata>
, Iceberg includes helpful system-table references which provide easy access to Iceberg-specific information on tables, including:
- The data files for a table
- The history of a table
- The manifest files for a table
- The snapshots for a table
TABLE( table_files( '<table_name>' ) )
: Query an Iceberg table's data file metadata using thetable_files()
function. Dremio returns records that have these fields:Column Data Type Description file_path VARCHAR Full file path and name file_format VARCHAR Format, for example, PARQUET partition VARCHAR Partition information record_count BIGINT Number of rows file_size_in_bytes BIGINT Size of the file column_sizes VARCHAR List of columns with the size of each column value_counts VARCHAR List of columns with the number of records with a value null_value_counts VARCHAR List of columns with the number of records as NULL nan_value_counts VARCHAR List of columns with the number of records as NaN lower_bounds VARCHAR List of columns with the lower bound of each upper_bounds VARCHAR List of columns with the upper bound of each key_metadata VARCHAR Key metrics split_offsets VARCHAR Split offsets TABLE( table_history( '<table_name>' ) )
: Query an Iceberg table's history metadata using thetable_history()
function. Dremio returns records that have these fields:Column Data Type Description made_current_at TIMESTAMP The timestamp the Iceberg snapshot was made at snapshot_id VARCHAR The Iceberg snapshot parent_id VARCHAR The parent snapshot ID, null if not exists is_current_ancestor BOOLEAN If the snapshot is part of the current history, shows abandoned snapshots TABLE( table_manifests( '<table_name>' ) )
: Query an Iceberg table's manifest file metadata using thetable_manifests()
function. Dremio returns records that have these fields:Column Data Type Description path VARCHAR Full path and name of the manifest file length BIGINT Size in bytes partition_spec_id VARCHAR ID of the partition added_snapshot_id VARCHAR ID of the snapshot added to the manifest added_data_files_count BIGINT Number of new data files added existing_data_files_count BIGINT Number of existing data files deleted_data_files_count BIGINT Number of files removed partition_summaries VARCHAR Partition information TABLE( table_snapshot( '<table_name>' ) )
: Query an Iceberg table's snapshot metadata using thetable_snapshot()
function. Dremio returns records that have these fields:Column Data Type Description committed_at TIMESTAMP The timestamp the Iceberg snapshot was committed snapshot_id VARCHAR The Iceberg snapshot ID parent_id VARCHAR The parent snapshot ID, null if it does not exist operation VARCHAR The Iceberg operation (for example, append) manifest_list VARCHAR List of manifest files for the snapshot summary VARCHAR Additional attributes (records added, etc.)
UNNEST( <list_expression> ) [ WITH ORDINALITY ]
String
Optional
Expands a LIST into a table with a single row for each element in the LIST. WITH ORDINALITY
returns an additional column with the offset of each element, which can be used with the ORDER BY
clause to order the rows by their offsets. UNNEST
cannot unnest a correlated variable.
{ PIVOT | UNPIVOT } ( <expression> )
String
Optional
PIVOT
converts a set of data from rows into columns. UNPIVOT
converts a set of data from columns into rows. The expression can be one of the following:
- pivot_clause: The query to aggregate the data on.
- pivot_for_clause: Which columns to group and pivot on.
- pivot_in_clause: Filters the values for the columns pivot_for_clause. Each of the values in this clause will be a separate column.
NOTE: This keyword is applied to a
SELECT
statement. The syntax does not support an alias between the table/subquery and either the PIVOT
or UNPIVOT
clause. For example, SELECT name, dept FROM employees) <alias> PIVOT <query>
is not supported.WHERE <condition>
Boolean
Optional
Filters your query and extracts only the records that fulfill a specified condition. The following operators can be used: +
, >
, <
, >=
, <
=, { <>
| !=
}, BETWEEN
, LIKE
, IN
. Additionally, <condition>
can include logical operators, such as AND
, OR
, and NOT
.
GROUP BY <expression>
String
Optional
Groups rows with the same group-by-item expressions and computes aggregate functions (such as COUNT()
, MAX()
, MIN()
, SUM()
, AVG()
) for the resulting group. A GROUP BY
expression can be one or more column names, a number referencing a position in the SELECT
list, or a general expression.
QUALIFY <expression>
Boolean
Filters the results of window functions. To use QUALIFY, at least one window function must be present in either the SELECT statement or within the QUALIFY expression. The expression filters the result after aggregates and window functions are computed; it can also contain window functions. The boolean expression can be the result of a subquery.
ORDER BY <column_name1>, <column_name2>, … [ DESC ]
String
Optional
Sort the result by a specific column. By default, the records are sorted in ascending order. Use DESC
to sort the records in descending order.
LIMIT <count>
Integer
Optional
Constrains the maximum number of rows returned by the query. Must be a non-negative integer.
AT
Optional
Specifies a reference for the current SQL session. When this parameter is omitted, the current reference is used.
REF[ERENCE] <reference>
: Applies to Arctic catalogs only. Identifies the branch, tag, or commit that you want to use as the new commit reference.BRANCH <branch_name>
: Applies to Arctic catalogs only. Identifies the branch to run the query against.TAG <tag_name>
: Applies to Arctic catalogs only. Identifies the tag to run the query against.COMMIT <commit_hash>
: Applies to Arctic catalogs only. Identifies the commit to run the query against. Commit hashes must be enclosed in double quotes (for example,“ff2fe50fef5a030c4fc8e61b252bdc33c72e2b6f929d813833d998b8368302e2”
).SNAPSHOT '<snapshot_id>'
: Applies to Iceberg and Delta Lake tables only. A time-travel query that enables you to specify an earlier version of a table to read. A snapshot ID is obtained either through thetable_history()
ortable_snapshot()
metadata function.<timestamp>
: Available for Iceberg and Delta Lake table queries only. Changes the commit reference point to the most recent Iceberg snapshot as of the provided timestamp.<timestamp>
may be any SQL expression that resolves to a single timestamp type value, for example:CAST( DATE_SUB(CURRENT_DATE,1) AS TIMESTAMP )
orTIMESTAMP '2022-07-01 01:30:00.000'
.
Examples
Query an existing table in a data lake sourceSELECT *
FROM Samples."samples.dremio.com"."zips.json"
SELECT city
FROM Samples."samples.dremio.com"."zips.json"
SELECT DISTINCT city
FROM Samples."samples.dremio.com"."zips.json"
SELECT *
FROM Samples."samples.dremio.com"."zips.json"
WHERE state = 'MA' AND city = 'AGAWAM'
SELECT passenger_count, trip_distance_mi, fare_amount,
RANK() OVER (PARTITION BY passenger_count ORDER BY trip_distance_mi) AS pc_rank
FROM "NYC-taxi-trips"
QUALIFY pc_rank = 1
SELECT passenger_count, trip_distance_mi, fare_amount
FROM "NYC-taxi-trips"
QUALIFY RANK() OVER (PARTITION BY passenger_count ORDER BY trip_distance_mi) = 1
SELECT COUNT(city), city, state
FROM Samples."samples.dremio.com"."zips.json"
GROUP BY state, CITY
ORDER BY COUNT(city) DESC
WITH cte_quantity (Total)
AS (
SELECT SUM(passenger_count) as Total
FROM Samples."samples.dremio.com"."NYC-taxi-trips" where passenger_count > 2
GROUP BY pickup_datetime
)
SELECT AVG(Total) average_pass
FROM cte_quantity
ALTER DATASET Samples."samples.dremio.com"."SF weather 2018-2019.csv" REFRESH METADATA auto promotion FORCE UPDATE;
SELECT * FROM (
SELECT EXTRACT(YEAR FROM CAST(F AS DATE)) as "YEAR",
EXTRACT(MONTH FROM CAST(F AS DATE)) as "MONTH",
K as MAX_TEMP
FROM Samples."samples.dremio.com"."SF weather 2018-2019.csv"
where F <> 'DATE'
)
PIVOT (
max(MAX_TEMP) for "MONTH" in (1 as JAN, 2 as FEB, 3 as MAR, 4 as APR, 5 as MAY, 6 as JUN, 7 as JUL, 8 as AUG, 9 as SEP, 10 as OCT, 11 as NOV, 12 as "DEC")
)
UNPIVOT (
GLOBAL_MAX_TEMP for "MONTH" in (JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, "DEC")
)
ORDER BY "YEAR", "MONTH";
SELECT index,
UPPER(array_item)
FROM UNNEST ( ARRAY [ 'a', 'b', 'c' ]) WITH ORDINALITY AS my_table ( array_item, index)
ORDER BY index;
Examples for Iceberg Tables
Time travel query on an Iceberg table using a timestampSELECT *
FROM myTable AT TIMESTAMP '2022-01-01 17:30:50.000'
SELECT *
FROM myTable AT SNAPSHOT '5393090506354317772'
SELECT *
FROM TABLE(table_history('myTable'))
WHERE snapshot_id = 4593468819579153853
SELECT count(*)
FROM TABLE(table_snapshot('myTable'))
GROUP BY snapshot_id
Column Aliasing
If you specify an alias for a column or an expression in the SELECT
clause, you can refer to that alias elsewhere in the query, including in the SELECT
list or in the WHERE
clause.
SELECT c_custkey AS c, lower(c)
FROM "customer.parquet"
SELECT c_custkey AS c, lower(c)
FROM (
SELECT c_custkey, c_mktsegment AS c
FROM "customer.parquet")
SELECT c_name AS n, n
FROM (
SELECT c_mktsegment AS n, c_name
FROM "customer.parquet")
AS MY_TABLE
WHERE n = 'BUILDING'
SELECT c_custkey
FROM (
SELECT c_custkey, c_name AS c
FROM "customer.parquet" )
WHERE c = 'aa'
SELECT *
FROM (
SELECT c_custkey AS c, c_name
FROM "customer.parquet" )
JOIN "orders.parquet" ON c = o_orderkey
SELECT c_custkey AS c
FROM "customer.parquet"
JOIN "orders.parquet" ON c = o_orderkey
Examples for Arctic Catalogs
Query a table using a (branch) referenceSELECT *
FROM myCatalog.demo_table AT REF main_branch
SELECT *
FROM myCatalog.demo_view AT COMMIT "7f643f2b9cf250ce1f5d6ff4397237b705d866fbf34d714"
Distributing Data Evenly Across Execution Engines During Joins
You can use a BROADCAST
hint if a query profile indicates that data involved in a join of two tables is heavily skewed and overloading one or more execution engines. The hint forces an even distribution of the data across all execution engines.
These hints are ignored for nested-loop joins and are not supported on views.
A BROADCAST
hint must be used immediately after the name of a table.
/*+ BROADCAST */
SELECT *
FROM T1 /*+ BROADCAST */
INNER JOIN t2 ON t1.key = t2.key
INNER JOIN t3 ON t2.key = t3.key
SELECT *
FROM T1
INNER JOIN (select key, max(cost) cost from t2 /*+ BROADCAST */) T2 ON t1.key = t2.key
INNER JOIN t3 ON t2.key = t3.key