Skip to main content
Version: current [24.2.x]

Copying Data Into Apache Iceberg Tables

Performing analytics at scale on data that is in CSV or JSON files is not ideal. You can get much faster response times for your queries by querying data in Apache Iceberg tables, which use the column-oriented Parquet file format. This format is column-oriented, and supports efficient data storage and data retrieval at very high volumes and concurrencies. When your data is in Iceberg tables, you can then make use of all of the features in Dremio's support of such tables.

You can load data from CSV or JSON files into existing Iceberg tables. The operation loads data into columns in the target table that match corresponding columns represented in the data.

The operation is supported on Iceberg tables in the following types of catalogs:

  • Glue
  • Hive Metastore
  • Nessie

The storage location can be in the following types of object storage:

  • ADLS
  • GCS
  • HDFS
  • NAS
  • S3

The operation verifies that at least one column in the target table matches a column represented in the data files. It then follows these rules:

  • If a match is found, the values in the data files are loaded into the column or columns.
  • If additional non-matching columns are present in the data files, the values in these columns are not loaded.
  • If additional non-matching columns are present in the target table, the operation inserts NULL values into these columns.
  • If no column in the target table matches any column represented in the data files, the operation fails.

The operation ignores case when comparing column names.

To perform this operation, use the COPY INTO <table> SQL command.

Routing to Specific Queues

You can route jobs that run the COPY INTO <table> command to specific queues by using a routing rule that uses the query_label() condition. For more information, see Workload Management.

Requirements

  • At least one column in the target table must match a column represented in every data file.
  • Do not duplicate column names in files. The operation throws an error if it finds duplicate names.
  • CSV data files must have a header line at the start of the file.