Delta Lake

Delta Lake is an open source storage layer that makes data lakes more reliable. This topic describes the Delta Lake table format.

NOTE:

Dremio version 14.0.0 and later provides read-only support for the Delta Lake table format.

Overview

Delta Lake is an open source table format that provides transactional consistency and increased scale for datasets by creating a consistent definition of datasets and including schema evolution changes and data mutations. With Delta Lake, updates to the datasets are viewed in a consistent manner across any application consuming the datasets, and users are kept from seeing an inconsistent view of data during transformation. This creates a consistent and reliable view of datasets in the data lake as they are updated and evolved.

Data consistency is enabled by creating a series of manifest files which define the schema and data for a given point in time as well as a transaction log that defines an ordered record of every transaction on the dataset. By reading the transaction log and manifest files, applications are guaranteed to see a consistent view of data at any point in time and users can ensure intermediate changes are invisible until a write operation is complete.

Delta Lake provides:

  • Large-scale support: Efficient metadata handling enables applications to readily process petabyte-sized datasets with millions of files
  • Schema consistency: All applications processing a dataset operate on a consistent and shared definition of the dataset metadata such as columns, data types, partitions.

Supported Data Sources

The Delta Lake table format is supported with the following sources in the Parquet file format:

Analyzing Delta Lake Datasets

Dremio supports analyzing Delta Lake datasets on the sources listed above through a native and high-performance reader. It automatically identifies which datasets are saved in the Delta Lake format, and imports table information from the Delta Lake manifest files. Dataset promotion is seamless and operates the same as any other data format in Dremio, where users can promote file system directories containing a Delta Lake dataset to a table manually or automatically by querying the directory. When using Delta Lake format, Dremio supports datasets of any size including petabyte-sized datasets with billions of files.

Dremio reads Delta Lake tables created or updated by another engine such as Spark and others with transactional consistency.Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format for the user.

Refreshing Metadata

Metadata refresh is required to query the latest version of a Delta Lake table. You can wait for an automatic refresh of metadata or manually refresh it.

Example of Querying a Delta Lake Table

Perform the following steps to query a Delta Lake table:

  1. In the Dremio UI, navigate to Datasets.
  2. Go to the source that contains the Delta Lake table.
  3. Click the Delta Lake table to format it into a physical dataset. Dremio automatically identifies tables that are in the Delta Lake format and selects the appropriate format.

  1. Click OK.

  2. Run a query on the Delta Lake table to see the results.

  3. Update the Delta Lake table in Spark.

  4. Go back to the Datasets UI and wait for the table metadata to refresh or manually refresh it using the syntax below.

     ALTER TABLE <path_of_the_dataset> 
     REFRESH METADATA
    

    The following statement shows refreshing metadata of a Delta Lake table.

     ALTER TABLE s3."data.dremio.com".data.deltalake."tpcds10_delta"."call_center"
     REFRESH METADATA
    
  5. Run the previous query on the Delta Lake table to retrieve the results from the updated Delta Lake table.

Limitations

  • Creating Delta Lake tables is not supported.
  • DML operations are not supported.
  • Incremental reflections are not supported.
  • Metadata refresh is required to query the latest version of a Delta Lake table.
  • Time travel or data versioning is not supported.