Skip to main content

Autoingesting Data into Apache Iceberg Preview

You can use autoingest pipes to automatically ingest data into Apache Iceberg tables.

Autoingest pipes are objects in Dremio that represent event-driven ingestion pipelines, which collect and load data into a centralized data repository for further analysis and utilization. Event-driven ingestion, or autoingestion, occurs when a new file is added to a specified cloud storage location, which sets off an event in Dremio to copy the new file into an Iceberg table. Dremio automatically ingests the file with micro-batch processing, loading files in small, predefined batches at regular intervals.

Autoingest pipes remove the complexity and operational overhead of setting up, running, and monitoring data pipelines by providing:

  • Single-Command Setup: Dremio provides a streamlined process for setting up and running autoingest pipes. Create an autoingest pipe using the CREATE PIPE SQL command to specify the parameters, and run a cloud-specific CLI command to set up the required infrastructure and connect your cloud storage to Dremio Cloud for autoingestion.

  • File Deduplication: File deduplication eliminates copies of files and enhances storage utilization. Dremio's autoingest pipes provide file deduplication by ensuring that your pipe loads semantics only once and preventing files with the same name from loading in a given time period (the DEDUPE_LOOKBACK_PERIOD).

  • Event-Based Execution of COPY INTO: After a new file is added in the source location, an event is sent to Dremio to run a COPY INTO statement. Ingested files are processed in micro-batches to optimize user resources and ensure that the data is efficiently loaded into Dremio.

  • Execution Monitoring and Error Handling: For common issues that can occur with data pipelines, such as single rows that do not conform to the target table schema or read permissions being revoked on the source, Dremio takes the appropriate action to alert the user and suggest potential solutions.

  • Efficient Batching for Optimal Engine Utilization: When implementing an event-based loading system, users often execute a load command for every file immediately after the file is added to cloud storage. This frequent loading increases the likelihood that Iceberg file sizes will be smaller than the optimal size and the engine will be overutilized. Both of these issues result in higher total cost of ownership because they require running OPTIMIZE TABLE jobs more frequently to compact Iceberg tables, which uses engine resources inefficiently. Dremio’s autoingest pipes efficiently organize requests into micro-batches that minimize the cost of loading data and maintain acceptable latency in a data lifecycle.

note

Autoingest pipes can only ingest data from Amazon S3 sources in Dremio.

Autoingest Pipe Operation

When you use the CREATE PIPE SQL command to create an autoingest pipe, use the following parameters to define the pipe's behavior. Read Configuring an Autoingest Pipe for examples that show how to use these parameters.

Syntax for CREATE PIPE
CREATE PIPE [ IF NOT EXISTS ] <pipe_name>
[ DEDUPE_LOOKBACK_PERIOD <number_of_days> ]
AS COPY INTO <table_name>
[ AT BRANCH <branch_name> ]
FROM '@<storage_location_name>/[ /<folder_name> ]'
[ FILE_FORMAT 'csv' | 'json' | 'parquet']
[ ( [csv_format_options] | [json_format_options] | [parquet_format_options]) ]
ParameterDescription
DEDUPE_LOOKBACK_PERIODOptional parameter that specifies how many days to look back when comparing incoming files to previously loaded files. If the autoingest pipe writes two files with the same name to the storage location within the lookback period, Dremio considers the second file a duplicate record and does not load it, even if a user explicitly deletes and reuploads a file with the same name. The default value is 14 days. You can set the DEDUPE_LOOKBACK_PERIOD parameter to any value from 0 to 90 days. If you set the parameter to 0 days, Dremio does not perform file deduplication.
COPY INTORequired parameter that specifies the target table and expected file format.
@<storage_location_name>Required parameter that specifies the storage directory where new files are loaded. Autoingest pipes can load from a hierarchical directory structure, but they do not load file paths or Hive partition columns that are embedded in the file paths.

Configuring an Autoingest Pipe

Follow the steps in this section to configure an autoingest pipe.

Step 1: Create an Iceberg Table

You can create the table in any catalog, although the table schema must match the expected schema for all files in the cloud storage source. See the example below:

Example of creating an Iceberg table
CREATE TABLE Pipe_sink
(Col_one int, Col_two varchar)

If a file has at least one record that does not match the schema, Dremio skips the entire file. To check the file load status, see sys.project.copy_file_history.

Step 2: Create an Autoingest Pipe

When you use the CREATE PIPE command to create an autoingest pipe, you must use a preconfigured data source in Dremio that allows for ingesting files, which is used as the @<storage_location_name> in the COPY INTO statement. In this source, you can specify the folder or its subfolders by adding /<folder_name>. Any new files that are added to the specified folder or its subdirectories are loaded into the target table.

The following example shows how to create an autoingest pipe for an Amazon S3 source:

Example of creating a pipe
CREATE PIPE Example_pipe AS
COPY INTO Pipe_sink
FROM '@<storage_location_name>/folder'
FILE_FORMAT 'csv'

Step 3: Retrieve Cloud Command Settings

To link the pipe to your cloud storage location and retrieve cloud command settings, run a SELECT statement for cloud_cli_command_settings:

Example of retrieving cloud command settings
SELECT cloud_cli_command_settings
FROM sys.project."pipes"
WHERE pipe_name = 'Example_pipe'

The result is a cloud-specific CLI command that you can run to connect your cloud storage to Dremio Cloud.

For example, the command returns this output for an Amazon S3 source:

cloud_cli_command_settings
aws s3api put-bucket-notification-configuration --bucket tlelek-test-bucket --notification-configuration "$(notification=$(aws s3api get-bucket-notification-configuration --bucket pipetest); if [ -z "$notification" ]; then echo '{"TopicConfigurations": []}'; else echo "$notification"; fi | jq --argjson new_config '{"Id":"pipe_5b4b8cdb-a308-4919-ac42-486d4bbcd879_525ffe04-97bb-473e-8e76-fa559faa7d0d","TopicArn":"arn:aws:sns:us-west-2:1234abc:dremio-69505c78-17f8-456e-ba37-aa72a151fec8","Events":["s3:ObjectCreated:Put","s3:ObjectCreated:Post"],"Filter":{"Key":{"FilterRules":[{"Name":"prefix","Value":"ingestionE2E/232"}]}}}' '.TopicConfigurations += [$new_config]')"

Step 4: Execute Cloud Command Settings

cloud_cli_command_settings is a self-contained AWS CLI command that sets the appropriate event notifications on the Amazon S3 bucket identified in the pipe definition to alert Dremio when new files are added to the bucket.

To execute the CLI command, you must have s3:PutBucketNotification permissions. Contact your cloud administrator to ensure that you have the required permissions.

After you execute the cloud_cli_command_settings CLI command, it may take a few minutes to take effect.

note

Amazon S3 does not support multiple notifications for the same prefix path, so you must have unique paths for the S3 notification settings.

Step 5: Add New Files to Your Amazon S3 Source

Now that you have established connectivity between your cloud storage and Dremio, the autoingest pipe will run every time you add new files to the source.

tip

For other autoingest pipe commands, see ALTER PIPE, DESCRIBE PIPE, and DROP PIPE.

Recommendations

Iceberg Optimization

If a pipe is operating frequently, you may need to run OPTIMIZE TABLE to compact small data files in your Iceberg table. The frequency of maintenance depends on the incoming file size and load frequency.