Delta Lake
Delta Lake is an open source storage layer that makes data lakes more reliable. This topic describes the Delta Lake table format.
Dremio version 14.0.0 and later provides read-only support for the Delta Lake table format.
Overview
Delta Lake is an open source table format that provides transactional consistency and increased scale for datasets by creating a consistent definition of datasets and including schema evolution changes and data mutations. With Delta Lake, updates to the datasets are viewed in a consistent manner across any application consuming the datasets, and users are kept from seeing an inconsistent view of data during transformation. This creates a consistent and reliable view of datasets in the data lake as they are updated and evolved.
Data consistency is enabled by creating a series of manifest files which define the schema and data for a given point in time as well as a transaction log that defines an ordered record of every transaction on the dataset. By reading the transaction log and manifest files, applications are guaranteed to see a consistent view of data at any point in time and users can ensure intermediate changes are invisible until a write operation is complete.
Delta Lake provides the following benefits:
-
Large-scale support: Efficient metadata handling enables applications to readily process petabyte-sized datasets with millions of files
-
Schema consistency: All applications processing a dataset operate on a consistent and shared definition of the dataset metadata such as columns, data types, partitions.
Supported Data Sources
The Delta Lake table format is supported with the following sources in the Parquet file format:
-
Hive (supported in Dremio 24.0 and later)
Analyzing Delta Lake Datasets
Dremio supports analyzing Delta Lake datasets on the sources listed above through a native and high-performance reader. It automatically identifies which datasets are saved in the Delta Lake format, and imports table information from the Delta Lake manifest files. Dataset promotion is seamless and operates the same as any other data format in Dremio, where users can promote file system directories containing a Delta Lake dataset to a table manually or automatically by querying the directory. When using Delta Lake format, Dremio supports datasets of any size including petabyte-sized datasets with billions of files.
Dremio reads Delta Lake tables created or updated by another engine such as Spark and others with transactional consistency.Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format for the user.
Refreshing Metadata
Metadata refresh is required to query the latest version of a Delta Lake table. You can wait for an automatic refresh of metadata or manually refresh it.
Example of Querying a Delta Lake Table
Perform the following steps to query a Delta Lake table:
-
In Dremio, open the Datasets page.
-
Go to the data source that contains the Delta Lake table.
-
If the data source is not AWS Glue or Hive, follow these steps:
- Hover over the row for the table and click to the right. Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format.
- Click Save.
-
If the data source is AWS Glue or Hive, hover over the row for the table and click to the right.
-
Run a query on the Delta Lake table to see the results.
-
Update the Delta Lake table in the data source.
-
Go back to the Datasets UI and wait for the table metadata to refresh or manually refresh it using the syntax below.
Syntax to manually refresh table metadataALTER TABLE <path_of_the_dataset>
REFRESH METADATAThe following statement shows refreshing metadata of a Delta Lake table.
Example command to manually refresh table metadataALTER TABLE s3."data.dremio.com".data.deltalake."tpcds10_delta"."call_center"
REFRESH METADATA -
Run the previous query on the Delta Lake table to retrieve the results from the updated Delta Lake table.
Table metadata and time travel queries
Dremio supports time travel on Delta lake tables from 24.2.0. You can query a Delta table's history using the following SQL commands:
SELECT * FROM TABLE(table_history('<full path of the table>'));
SELECT * FROM TABLE(table_snapshot('<full path of the table>'));
And get the data from a specific timestamp with the following SQL:
SELECT * FROM <table> AT TIMESTAMP '2019-10-07 18:13:16.852';
Limitations
- Creating Delta Lake tables is not supported.
- DML operations are not supported.
- Incremental reflections are not supported.
- Metadata refresh is required to query the latest version of a Delta Lake table.
- Only Delta Lake tables with minReaderVersion 1 can be read.