SELECT
The SELECT
command queries table data and table metadata, and can query by timestamp and snapshot ID.
Dremio supports only the copy-on-write storage mechanism and reads only the latest data files for each Iceberg v2 table that you run SQL commands against. Dremio does not support Iceberg v2 tables that have merge-on-read manifests.
Querying a Table's Data
SyntaxSELECT [ <column1_name> [ , <column2_name> ... ] | * ]
FROM <table_name>
[WHERE where_condition]
[constructs]
Parameters
<column1_name> [ , <column2_name> ... ] String
The name of the column or columns that you want to query.
*
Indicates that you want to query all columns in the table.
<table_name> String
The name of the table that you want to query.
WHERE where_condition String
The condition to use to query a subset of the rows in the table.
contructs String
The SELECT
command supports many constructs, such as GROUP BY
, ORDER BY
, SORT
, and more.
Metadata Queries
Iceberg includes helpful system-table references which provide an easy access to Iceberg-specific information on tables, including:
- The data files for a table
- The history of a table
- The manifest files for a table
- The partitions of a table
- The snapshots for a table
- Information about records from CSV or JSON files that were in error and not loaded into a table by a COPY INTO operation for which ON ERROR was set to 'continue'
Querying a Table's Data File Metadata
Queries use the table_files()
function.
SELECT *
FROM TABLE( table_files('<table_name>') )
Dremio returns records that have these fields:
Column | Data Type | Description |
---|---|---|
file_path | VARCHAR | Full file path and name |
file_format | VARCHAR | Format, e.g. PARQUET |
record_count | BIGINT | Number of rows |
file_size_in_bytes | BIGINT | Size of file |
column_sizes | VARCHAR | List of columns with size of each column |
value_counts | VARCHAR | List of columns with number of records with a value |
null_value_counts | VARCHAR | List of columns with number of records as NULL |
nan_value_counts | VARCHAR | List of columns with number of records as NaN |
lower_bounds | VARCHAR | List of columns with lower bound of each |
upper_bounds | VARCHAR | List of columns with upper bound of each |
key_metadata | VARCHAR | Key metrics |
split_offsets | VARCHAR | Split offsets |
Parameters
<table_name> String
The name of the table that you want to query.
Querying a Table's History Metadata
Queries use the table_history()
function.
SELECT *
FROM TABLE( table_history('<table_name>') )
Dremio returns records that have these fields:
Column | Data Type | Description |
---|---|---|
made_current_at | TIMESTAMP | The timestamp the Iceberg snapshot was made at |
snapshot_id | VARCHAR | The Iceberg snapshot ID |
parent_id | VARCHAR | The parent snapshot ID, null if not exists |
is_current_ancestor | BOOLEAN | If the snapshot is part of the current history, shows abandoned snapshots |
Example
Example with table_history()SELECT *
FROM TABLE( table_history('myTable'))
WHERE snapshot_id = 4593468819579153853
Querying a Table's Manifest File Metadata
Queries use the table_manifests()
function.
SELECT *
FROM TABLE( table_manifests('<table_name>') )
Dremio returns records that have these fields:
Column | Data Type | Description |
---|---|---|
path | VARCHAR | Full path and name of manifest file |
length | BIGINT | Size in bytes |
partition_spec_id | VARCHAR | |
added_snapshot_id | VARCHAR | ID of snapshot added to manifest |
added_data_files_count | BIGINT | Number of new data files added |
existing_data_files_count | BIGINT | Number of existing data files |
deleted_data_files_count | BIGINT | Number of files removed |
partition_summaries | VARCHAR | Partition information |
Querying a Table's Partition Metadata
Queries use the table_partitions()
function to return partition-related statistics.
SELECT *
FROM TABLE( table_partitions('<table_name>') )
Dremio returns records that have these fields:
Column Name | Data Type | Description |
---|---|---|
partition | CHARACTER VARYING | The partition key |
record_count | INTEGER | The number of records in the partition |
file_count | INTEGER | The number of data files in the partition |
spec_id | INTEGER | The ID of the partition specification on which partition is based. |
Querying a Table's Snapshot Metadata
Queries use the table_snapshot()
function.
SELECT *
FROM TABLE( table_snapshot('<table_name>') )
Dremio returns records that have these fields:
Column | Data Type | Description |
---|---|---|
committed_at | TIMESTAMP | The timestamp the Iceberg snapshot was committed |
snapshot_id | VARCHAR | The Iceberg snapshot ID |
parent_id | VARCHAR | The parent snapshot ID, null if not exists |
operation | VARCHAR | The Iceberg operation (e.g. append) |
manifest_list | VARCHAR | List of manifest files for the snapshot |
summary | VARCHAR | Additional attributes (records added, etc) |
Example
Finds the number of snapshots for a tableSELECT count(*)
FROM TABLE( table_snapshot('myTable'))
GROUP BY snapshot_id
Querying information about rejected records in CSV or JSON files used by a COPY INTO operation for which ON_ERROR was set to 'continue'
Queries use the copy_errors()
function.
SELECT *
FROM TABLE(copy_errors(‘<table_name>’, [‘<query_id>’])
-
<table_name>
The name of the target table on which the COPY INTO operation was performed. -
<query_id>
Optional parameter. The ID of the job that ran the COPY INTO operation. You can obtain this ID from the SYS.COPY_ERRORS_HISTORY system table. If you do not specify an ID, the default value is the ID of the last job started by the current user to run COPY INTO on the target table.
The records returned consist of these fields:
Column | Data Type | Description |
---|---|---|
job_id | string | The ID of the job that ran the COPY INTO operation. |
file_name | string | The full path of the file where the validation error was encountered. |
line_number | long | The number of the line (physical position) in the file where the error was encountered. |
row_number | long | The row (record) number in input file. |
column_name | string | The name of the column where the error was encountered. |
error | string | A message describing the error. |
This function reconstructs the original COPY INTO command and executes it in validation mode. No records are inserted into the target table. The maximum number of input files that it uses is determined by the dremio.copy.into.errors_max_input_files
support key, which has a default value of 100. The selection of input files is based on alphabetical sorting and might represent a subset of the original list of input files that contain rejected records.
Time Travel Queries
Iceberg tables support both TIMESTAMP-based references and Snapshot ID-based references to specify an earlier version of a table to read.
Time Travel by Timestamps
The query returns an error stating that the reference for this table was out of range if either of these two conditions is true:
- The value of the timestamp is some time in the future, even by one second.
- The value of the timestamp is older than the oldest valid snapshot for the table.
To run a SELECT
command, the user must have Dremio's SELECT
privilege.
SELECT <column_name>
FROM <table_name> AT <timestamp>
Parameters
<column_name> String
The name of the column that you want to query.
<table_name> String
The name of the table that you want to query.
<timestamp> String
Uses the most recent Iceberg snapshot as of the provided timestamp. <timestamp>
may be any SQL expression that resolves to a single timestamp type value, for example: CAST( DATE_SUB(CURRENT_DATE,1) AS TIMESTAMP )
or TIMESTAMP '2022-07-01 01:30:00.000'
.
Example
Time travel query on an Iceberg table using a timestampSELECT count(*)
FROM my_table AT TIMESTAMP '2022-07-01 01:30:00.000'
Time Travel by Snapshot ID
To run a SELECT
command, the user must have Dremio's SELECT
privilege.
SELECT <column_name>
FROM <table_name> AT SNAPSHOT '<snapshot-id>'
Parameters
<column_name> String
The name of the column that you want to query.
<table_name> String
The name of the table that you want to query.
<snapshot_id> String
A snapshot ID obtained either through the table_history()
or table_snapshot()
metadata function.
Example
Time travel query on an Iceberg table using a snapshot IDSELECT *
FROM myTable AT SNAPSHOT '5393090506354317772'