Delta Lake is an open-source table format that provides transactional consistency and increased scale for datasets by creating a consistent definition of datasets and including schema evolution changes and data mutations. With Delta Lake, updates to datasets are viewed in a consistent manner across all applications consuming the datasets, and users are kept from seeing inconsistent views of data during transformations. Consistent and reliable views of datasets in a data lake are maintained even as the datasets are updated and modified over time.
Data consistency for a dataset is enabled through the creation of a series of manifest files which define the schema and data for a given point in time, as well as a transaction log that defines an ordered record of every transaction on the dataset. By reading the transaction log and manifest files, applications are guaranteed to see a consistent view of data at any point in time, and users can ensure intermediate changes are invisible until a write operation is complete.
Delta Lake provides the following benefits:
- Large-scale support: Efficient metadata handling enables applications to readily process petabyte-sized datasets with millions of files
- Schema consistency: All applications processing a dataset operate on a consistent and shared definition of the dataset metadata such as columns, data types, partitions.
Supported Data Sources
The Delta Lake table format is supported with the following sources in the Parquet file format:
Analyzing Delta Lake Datasets
Dremio supports analyzing Delta Lake datasets on the sources listed above through a native and high-performance reader. It automatically identifies which datasets are saved in the Delta Lake format, and imports table information from the Delta Lake manifest files. Dataset promotion is seamless and operates the same as any other data format in Dremio, where users can promote file system directories containing a Delta Lake dataset to a table manually or automatically by querying the directory. When using Delta Lake format, Dremio supports datasets of any size including petabyte-sized datasets with billions of files.
Dremio reads Delta Lake tables created or updated by another engine, such as Spark and others, with transactional consistency. Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format for the user.
Metadata refresh is required to query the latest version of a Delta Lake table. You can wait for an automatic refresh of metadata or manually refresh it.
Example of Querying a Delta Lake Table
Perform the following steps to query a Delta Lake table:
In Dremio, open the Datasets page.
Go to the data source that contains the Delta Lake table.
If the data source is not AWS Glue, follow these steps:
- Hover over the row for the table and click to the right. Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format.
- Click Save.
If the data source is AWS Glue, hover over the row for the table and click to the right.
Run a query on the Delta Lake table to see the results.
Update the table in the data source.
Go back to the Datasets UI and wait for the table metadata to refresh or manually refresh it using the syntax below.Syntax to manually refresh table metadata
ALTER TABLE `<path_of_the_dataset>`
The following statement shows refreshing metadata of a Delta Lake table.Example command to manually refresh table metadata
ALTER TABLE s3."data.dremio.com".data.deltalake."tpcds10_delta"."call_center"
Run the previous query on the Delta Lake table to retrieve the results from the updated Delta Lake table.
- Creating Delta Lake tables is not supported.
- DML operations are not supported.
- Incremental reflections are not supported.
- Metadata refresh is required to query the latest version of a Delta Lake table.
- Time travel or data versioning is not supported.