Skip to main content
Version: current [26.x Preview]

Connecting to Dremio Catalog from Apache Spark

You can use any Iceberg REST-compatible engine to read and write to Dremio Catalog. This page describes how to use Spark to connect to Dremio Catalog.

When using Spark, you can choose the following methods to authenticate with Dremio:

  1. Dremio Personal Access Token (PAT)
  2. OAuth2 with external IdP

You also need additional client-side work to enable Spark to properly authenticate with Dremio. These settings are discussed in the respective sections below.

Prerequisites

spark-sql --jars /path/to/authmgr-oauth2-runtime-0.0.1-dremio.jar

If you intend to use vended credentials, make sure to pass the following config to the spark-sql command:

spark-sql .. --conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials

Note: Ensure that the warehouse is set to default, as this is the warehouse used by Dremio Catalog.

Authenticating with Dremio Using Dremio PAT

Use this method if you want to use Spark with Dremio internal users. This method follows a two-step process:

1. Create a Dremio PAT

Select a user that will be used to authenticate Spark jobs and create a Dremio PAT for that user. Then, use the section below to configure Spark to use PAT.

2. Configure Spark to Use PAT to Access Dremio Catalog

Below is an example Spark configuration that would allow Spark to connect to Dremio Catalog with Iceberg REST, using a PAT for authentication:

export DREMIO_PAT=...
export DREMIO_ADDRESS=...

spark-sql \
--jars /path/to/authmgr-oauth2-runtime-0.0.1-dremio.jar \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.polaris.cache-enabled=false \
--conf spark.sql.catalog.polaris.type=rest \
--conf spark.sql.catalog.polaris.warehouse=default \
--conf spark.sql.catalog.polaris.uri=http://$DREMIO_ADDRESS:8181/api/catalog \
--conf spark.sql.catalog.polaris.rest.auth.type=com.dremio.iceberg.authmgr.oauth2.OAuth2Manager \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.token-endpoint=http://$DREMIO_ADDRESS:9047/oauth/token \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.grant-type=token_exchange \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.client-id=dremio \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.scope=dremio.all \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.token-exchange.subject-token="$DREMIO_PAT" \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.token-exchange.subject-token-type=urn:ietf:params:oauth:token-type:dremio:personal-access-token
note

“dremio” as a Client ID is not used for actual authentication. It can be any string. DREMIO_PAT represents the Dremio Personal Access Token (PAT).

Authenticating with Dremio Using OAuth2 (External Identity Provider)

Use this method if you want to use Spark with users defined in an external identity provider, e.g., Keycloak.

1. Configure Dremio to Use OAuth2 to Authenticate Spark

First, establish trust between Dremio and your identity provider.

note
  • Choose “Audience” to your liking. This value is critical in disambiguating different access paths that may involve the same IdP.
  • The value to set for “User Claim Mapping” depends on the IdP. It should point to the token claim that contains the value of the username that Dremio should use to map external users to internal users.
  • “Issuer URL” should be the same as seen from the Spark environment (otherwise token exchange will fail).
  • The OAuth configuration of the IdP should be done in a way to allow Spark clients to obtain tokens for the “Audience” configured above. In this document, we use Keycloak as an example and configure the “dremio-catalog-cli” client in Keycloak and assign a new “catalog” scope to it. Then, in the “catalog” scope we configure an “Audience” mapper to produce the custom “poc” Audience value.

2. Configure Spark to Use OAuth2

Below is an example of how you can use Spark to connect to Dremio Catalog, using an external IdP for user authentication. A summary of the process is below:

  1. Spark obtains a user-specific access token from an OAuth2 server (usually the IdP). Dremio requires that the token be in the form of a JWT for this use case.
  2. Spark connects to Dremio and exchanges the user’s IdP access token for a Dremio Access Token.
  3. Spark connects to Dremio Catalog using the Dremio Access Token.
export KEYCLOAK_ADDRESS=...
export DREMIO_ADDRESS=...
export CLIENT_SECRET=...

spark-sql \
--jars /path/to/authmgr-oauth2-runtime-0.0.1-dremio.jar \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.polaris.cache-enabled=false \
--conf spark.sql.catalog.polaris.type=rest \
--conf spark.sql.catalog.polaris.warehouse=default \
--conf spark.sql.catalog.polaris.uri=http://$DREMIO_ADDRESS:8181/api/catalog \
--conf spark.sql.catalog.polaris.rest.auth.type=com.dremio.iceberg.authmgr.oauth2.OAuth2Manager \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.issuer-url=http://$KEYCLOAK_ADDRESS:8080/realms/iceberg \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.grant-type=device_code \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.client-id=dremio \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.client-secret=$CLIENT_SECRET \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.scope=catalog \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.impersonation.enabled=true \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.impersonation.token-endpoint=http://$DREMIO_ADDRESS:9047/oauth/token \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.impersonation.scope=dremio.all \
--conf spark.sql.catalog.polaris.rest.auth.oauth2.token-exchange.subject-token-type=urn:ietf:params:oauth:token-type:jwt
note
  • The catalog client scope in Spark matches the catalog scope in Keycloak.
  • dremio in Spark matches the Keycloak client that has the poc Audience Mapper.

Using Dremio PAT for Authentication with Iceberg Versions Older Than 1.9

If you are using a version of Iceberg older than 1.9, a custom step is required to run the OAuth2 token exchange flow against Dremio in order to obtain an access token, since versions of Iceberg below 1.9 do not include AuthManager. Any OAuth2 client can be used for this. The below example uses curl for simplicity:

export DREMIO_PAT=...
export DREMIO_ADDRESS=...

curl -X POST https://$DREMIO_ADDRESS:9047/oauth/token -d "grant_type=urn:ietf:params:oauth:grant-type:token-exchange&scope=dremio.all&subject_token_type=urn:ietf:params:oauth:token-type:dremio:personal-access-token" --data-urlencode "subject_token=$DREMIO_PAT"

Extract the access token from the output of the token exchange flow. The below examples assume the token is stored in the $DREMIO_TOKEN variable.

note
  • The token exchange output will also provide a token expiry period.
  • It is also possible to obtain the access token via a custom IdP, but this is more challenging technically. Please contact Dremio for more information if this use case is required.

Configuring Spark to Use an OAuth Token

Below is an example Spark configuration that would allow Spark to connect to Dremio Catalog with the Iceberg REST API, using an OAuth token for authentication:

export DREMIO_TOKEN=...
export DREMIO_ADDRESS=...

spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.polaris.cache-enabled=false \
--conf spark.sql.catalog.polaris.type=rest \
--conf spark.sql.catalog.polaris.warehouse=default \
--conf spark.sql.catalog.polaris.uri=http://$DREMIO_ADDRESS:8181/api/catalog \
--conf spark.sql.catalog.polaris.token="$DREMIO_TOKEN"