Clustering Preview Enterprise

Clustered Iceberg tables in Dremio provide a more intuitive data layout with comparable performance characteristics to Iceberg partitioning.

Iceberg clustering sorts individual records in data files based on the clustered columns provided in the CREATE TABLE or ALTER TABLE statement. The data file level clustering of data allows Parquet metadata to be used in query planning and execution to reduce the amount of data scanned as part of the query. In addition, clustering eliminates common problems with partitioned data, such as over-partitioned tables and partition skew.

Recommendations

We recommend that you first tune Iceberg tables by clustering them, as clustering provides a general-purpose file layout that enables both efficient reads and writes. Note that you may not see immediate benefits from clustering if the tables are too small.

A common pattern is to choose clustered columns which are either primary keys of the table or commonly used for query filters. These column choices will effectively filter the working dataset thereby improving query times. When ordering the clustering columns, order them in precedence of filtering or cardinality with the most commonly queried columns of highest cardinality first.

Supported Data Types for Clustered Columns

Dremio Iceberg clustering supports clustered columns of the following data types:

DECIMAL
INT
BIGINT
FLOAT
DOUBLE
VARCHAR
VARBINARY
DATE
TIME
TIMESTAMP

CTAS Behavior and Clustering

When running a CREATE TABLE AS statement with clustering, the data is written in an unordered way. For the best performance, you should run an OPTIMIZE TABLE command after creating a table using a CREATE TABLE AS statement.

Limitations

OPTIMIZE TABLE commands on clustered tables must be run from Dremio to ensure that clustering is enforced.
Clustering keys must be columns in the table. Transformations are not supported.

Recommendations​

Supported Data Types for Clustered Columns​

CTAS Behavior and Clustering​

Limitations​

Recommendations

Supported Data Types for Clustered Columns

CTAS Behavior and Clustering

Limitations