External Engines

Dremio's Open Catalog is built on Apache Polaris, providing a standards-based, open approach to data catalog management. At its core is the Iceberg REST interface, which enables seamless integration with any query engine that supports the Apache Iceberg REST catalog specification. This open architecture means you can connect industry-standard engines such as Apache Spark, Trino, and Apache Flink directly to Dremio.

Engine	Best For	Key Features
Apache Spark	Data engineering, ETL	Token exchange, nested folders, views
Trino	Interactive analytics	Fast queries, BI workloads
Apache Flink	Real-time streaming	Event-driven, continuous pipelines

By leveraging the Iceberg REST standard, the Open Catalog acts as a universal catalog layer that query engines can communicate with using a common language. This allows organizations to build flexible data architectures where multiple engines can work together, each accessing and managing the same Iceberg tables through Dremio's centralized catalog.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing, widely used for ETL, batch processing, and data engineering workflows.

Prerequisites

This example uses Spark 3.5.3 with Iceberg 1.9.1. For other versions, ensure compatibility between Spark, Scala, and Iceberg runtime versions. Additional prerequisites include:

The following JAR files downloaded to your local directory:
- authmgr-oauth2-runtime-0.0.5.jar from Dremio Auth Manager releases. This open-source library handles token exchange, automatically converting your personal access token (PAT) into an OAuth token for seamless authentication. For more details about Dremio Auth Manager's capabilities and configuration options, see Introducing Dremio Auth Manager for Apache Iceberg.
- iceberg-spark-runtime-3.5_2.12-1.9.1.jar (from Apache Iceberg releases)
- iceberg-aws-bundle-1.9.1.jar (from Apache Iceberg releases)
Docker installed and running.
Your Dremio catalog name – The default catalog in each project has the same name as the project.
If authenticating with a PAT, you must generate a token. See Personal Access Tokens for step-by-step instructions.
If authenticating with an identity provider (IDP), your IDP or other external token provider must be configured as a trusted OAuth external token provider in Dremio.
You must have an OAuth2 client registered in your IDP configured to issue tokens that Dremio accepts (matching audience and scopes) and with a client ID and client secret provided by your IDP.

Authenticate with a PAT

You can authenticate your Apache Spark session with a Dremio personal access token using the following script. Replace <personal_access_token> with your Dremio personal access token and replace <catalog_name> with your catalog name.

In addition, you can adjust the volume mount paths to match where you've downloaded the JAR files and where you want your workspace directory. The example uses $HOME/downloads and $HOME/workspace.

Spark with PAT Authentication

#!/bin/bash
export CATALOG_NAME="<catalog_name>"
export DREMIO_PAT="<personal_access_token>"

docker run -it \
  -v $HOME/downloads:/opt/jars \
  -v $HOME/workspace:/workspace \
  apache/spark:3.5.3 \
  /opt/spark/bin/spark-shell \
  --jars /opt/jars/authmgr-oauth2-runtime-0.0.5.jar,/opt/jars/iceberg-spark-runtime-3.5_2.12-1.9.1.jar,/opt/jars/iceberg-aws-bundle-1.9.1.jar \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.polaris.type=rest \
  --conf spark.sql.catalog.polaris.cache-enabled=false \
  --conf spark.sql.catalog.polaris.warehouse=$CATALOG_NAME \
  --conf spark.sql.catalog.polaris.uri=https://catalog.dremio.cloud/api/iceberg \
  --conf spark.sql.catalog.polaris.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
  --conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials \
  --conf spark.sql.catalog.polaris.rest.auth.type=com.dremio.iceberg.authmgr.oauth2.OAuth2Manager \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.token-endpoint=https://login.dremio.cloud/oauth/token \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.grant-type=token_exchange \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.client-id=dremio-catalog-cli \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.scope=dremio.all \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.token-exchange.subject-token="$DREMIO_PAT" \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.token-exchange.subject-token-type=urn:ietf:params:oauth:token-type:dremio:personal-access-token

note

In this configuration, polaris is the catalog identifier used within Spark. This identifier is mapped to your actual Dremio catalog via the spark.sql.catalog.polaris.warehouse property.

Authenticate with an IDP

You can authenticate your Apache Spark session using an external token provider that has been integrated with Dremio.

Using this configuration:

Spark obtains a user-specific JWT from the external token provider.
Spark connects to Dremio and exchanges the JWT for an access token.
Spark connects to the Open Catalog using the access token.

Using the following script, replace <catalog_name> with your catalog name, <idp_url> with the location of your external token provider, <client_id> and <client_secret> with the credentials issued by the external token provider.

Spark with IDP Authentication

#!/bin/bash
export CATALOG_NAME="<catalog_name>"
export IDP_URL="<idp_url>"
export CLIENT_ID="<idp_client_id>" 
export CLIENT_SECRET="<idp_client_secret>" 

docker run -it \
  -v $HOME/downloads:/opt/jars \
  -v $HOME/workspace:/workspace \
  apache/spark:3.5.3 \
  /opt/spark/bin/spark-shell \
  --jars /opt/jars/authmgr-oauth2-runtime-0.0.5.jar,/opt/jars/iceberg-spark-runtime-3.5_2.12-1.9.1.jar,/opt/jars/iceberg-aws-bundle-1.9.1.jar \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.polaris.type=rest \
  --conf spark.sql.catalog.polaris.cache-enabled=false \
  --conf spark.sql.catalog.polaris.warehouse=$CATALOG_NAME \
  --conf spark.sql.catalog.polaris.uri=https://catalog.dremio.cloud/api/iceberg \
  --conf spark.sql.catalog.polaris.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
  --conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials \
  --conf spark.sql.catalog.polaris.rest.auth.type=com.dremio.iceberg.authmgr.oauth2.OAuth2Manager \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.issuer-url=$IDP_URL \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.grant-type=device_code \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.client-id=$CLIENT_ID \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.client-secret=$CLIENT_SECRET \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.scope=dremio.all \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.impersonation.enabled=true \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.impersonation.token-endpoint=https://login.dremio.cloud/oauth/token \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.impersonation.scope=dremio.all \
  --conf spark.sql.catalog.polaris.rest.auth.oauth2.token-exchange.subject-token-type=urn:ietf:params:oauth:token-type:jwt

Usage Examples

With these configurations, polaris is the catalog identifier used within Spark. This identifier is mapped to your actual Dremio catalog via the spark.sql.catalog.polaris.warehouse property. Once Spark is running and connected to your Dremio catalog:

List namespaces

spark.sql("SHOW NAMESPACES IN polaris").show()

Query a table

spark.sql("SELECT * FROM polaris.your_namespace.your_table LIMIT 10").show()

Create a table

spark.sql("""
  CREATE TABLE polaris.your_namespace.new_table (
    id INT,
    name STRING
  ) USING iceberg
""")

Trino

Trino is a distributed SQL query engine designed for fast analytic queries against data sources of all sizes. It excels at interactive SQL analysis, ad hoc queries, and joining data across multiple sources.

Prerequisites

Docker installed and running.
A valid Dremio personal access token – See Personal Access Tokens for instructions to generate a personal access token.
Your Dremio catalog name – The default catalog in each project has the same name as the project.

Configuration

To connect Trino to Dremio using Docker, follow these steps:

Create a directory for Trino configuration and add a catalog configuration:
```
mkdir -p ~/trino-config/catalog
```
In trino-config/catalog, create a catalog configuration file named polaris.properties with the following values:
Trino polaris.properties
```
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=https://catalog.dremio.cloud/api/iceberg
iceberg.rest-catalog.oauth2.token=<personal_access_token>

iceberg.rest-catalog.warehouse=<catalog_name>
iceberg.rest-catalog.security=OAUTH2

iceberg.rest-catalog.vended-credentials-enabled=true
fs.native-s3.enabled=true
s3.region=<region> 
```
Replace the following:
- <personal_access_token> with your Dremio personal access token.
- <catalog_name> with your catalog name.
- <region> with the AWS region where your data is stored, such as us-west-2.
note
- In this configuration, polaris (from the filename polaris.properties) is the catalog identifier used in Trino queries. The iceberg.rest-catalog.warehouse property maps this identifier to your actual Dremio catalog.
- In oauth2.token, you provide your Dremio personal access token directly. Dremio's catalog API accepts PATs as bearer tokens without requiring token exchange.

Pull and start the Trino container:

docker run --name trino -d -p 8080:8080 trinodb/trino:latest

Verify that Trino is running:
```
docker ps
```
You can access the web UI at http://localhost:8080 and log in as admin.

Restart Trino with the configuration:

docker stop trino
docker rm trino

# Start with mounted configuration
docker run --name trino -d -p 8080:8080 -v ~/trino-config/catalog:/etc/trino/catalog trinodb/trino:latest

# Verify Trino is running
docker ps

# Check logs
docker logs trino -f

In another window, connect to the Trino CLI:
```
docker exec -it trino trino --user admin
```
You should see the Trino prompt:
```
trino>
```
Verify the catalog connection:
```
trino> show catalogs;
```

Usage Examples

Once Trino is running and connected to your Dremio catalog:

List namespaces

trino> show schemas from polaris;

Query a table

trino> select * from polaris.your_namespace.your_table;

Create a table

trino> CREATE TABLE polaris.demo_namespace.test_table (
  id INT,
  name VARCHAR,
  created_date DATE,
  value DOUBLE
);

Limitations

Case sensitivity: Namespace and table names must be in lowercase. Trino will not list or access tables in namespaces that begin with an uppercase character.
View compatibility: Trino cannot read views created in Dremio due to SQL dialect incompatibility. Returns error: "Cannot read unsupported dialect 'DremioSQL'."

Apache Flink

Apache Flink is a distributed stream processing framework designed for stateful computations over bounded and unbounded data streams, enabling real-time data pipelines and event-driven applications.

To connect Apache Flink to Dremio using Docker Compose, follow these steps:

Prerequisites

You'll need to download the required JAR files and organize them in a project directory structure.

Create the project directory structure:

mkdir -p flink-dremio/jars
cd flink-dremio

Download the required JARs into the jars/ directory:

Iceberg Flink Runtime 1.20:

wget -P jars/ https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.20/1.9.1/iceberg-flink-runtime-1.20-1.9.1.jar

Iceberg AWS Bundle for vended credentials:

wget -P jars/ https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.9.1/iceberg-aws-bundle-1.9.1.jar

Hadoop dependencies required by Flink:

wget -P jars/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar

Create the Dockerfile.

Create a file named Dockerfile in the flink-dremio directory:
Flink Dockerfile
```
FROM flink:1.20-scala_2.12 

# Copy all required JARs
COPY jars/*.jar /opt/flink/lib/
```

Create the docker-compose.yml file in the flink-dremio directory:

Flink docker-compose.yml

services:
  flink-jobmanager:
    build: . 
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: flink-jobmanager
        parallelism.default: 2        
      - AWS_REGION=us-west-2

  flink-taskmanager:
    build: .        
    depends_on:
      - flink-jobmanager
    command: taskmanager
    scale: 1
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: flink-jobmanager
        taskmanager.numberOfTaskSlots: 4
        parallelism.default: 2
      - AWS_REGION=us-west-2

Build and start the Flink cluster:

# Build and start the cluster
docker-compose build --no-cache
docker-compose up -d

# Verify the cluster is running
docker-compose ps

# Verify required JARs are present
docker-compose exec flink-jobmanager ls -la /opt/flink/lib/ | grep -E "(iceberg|hadoop)"

You should see the JARs you downloaded in the previous step.

Connect to the Flink SQL client:
```
docker-compose exec flink-jobmanager ./bin/sql-client.sh
```
You can also access the Flink web UI at http://localhost:8081 to monitor jobs.
Create the Dremio catalog connection in Flink:
```
CREATE CATALOG polaris WITH (
  'type' = 'iceberg',
  'catalog-impl' = 'org.apache.iceberg.rest.RESTCatalog',
  'uri' = 'https://catalog.dremio.cloud/api/iceberg',
  'token' = '<personal_access_token>', 
  'warehouse' = '<catalog_name>',
  'header.X-Iceberg-Access-Delegation' = 'vended-credentials',
  'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO'
);
```
Replace the following:
- <personal_access_token> with your Dremio personal access token.
- <catalog_name> with your catalog name.
note
- In this configuration, polaris is the catalog identifier used in Flink queries. The CREATE CATALOG command maps this identifier to your actual Dremio catalog.
- In token, you provide your Dremio personal access token directly. Dremio's catalog API accepts PATs as bearer tokens without requiring token exchange.
Verify the catalog connection:
```
Flink SQL> show catalogs;
```

Usage Examples

Once Apache Flink is running and connected to your Dremio catalog:

List namespaces

Flink SQL> show databases in polaris;

Query a table

Flink SQL> select * from polaris.your_namespace.your_table;

Create a table

Flink SQL> CREATE TABLE polaris.demo_namespace.test_table (
  id INT,
  name STRING,
  created_date DATE,
  `value` DOUBLE
);

Limitations

Reserved keywords: Column names that are reserved keywords, such as value, timestamp, and date, must be enclosed in backticks when creating or querying tables.

Apache Spark​

Prerequisites​

Authenticate with a PAT​

Authenticate with an IDP​

Usage Examples​

Trino​

Prerequisites​

Configuration​

Usage Examples​

Limitations​

Apache Flink​

Prerequisites​

Usage Examples​

Limitations​

Apache Spark

Prerequisites

Authenticate with a PAT

Authenticate with an IDP

Usage Examples

Trino

Prerequisites

Configuration

Usage Examples

Limitations

Apache Flink

Prerequisites

Usage Examples

Limitations