Google BigLake CatalogPreview
Google BigLake is an Iceberg lakehouse built on top of the Google Cloud Storage platform.
Dremio creates connections to Google BigLake using its Iceberg REST Catalog connector.
Connect to a Google BigLake Catalog
- In the Dremio console, click Add Data on the Home page.
- In the Add Data dialog, select Iceberg REST Catalog.
- Configure the connection using the sections below, then click Save.
General
To configure the source connection:
- For Name, enter a name for the source. The name you enter must be unique in the organization. Also, consider a name that is easy for users to reference. This name cannot be edited once the source is created. The name cannot exceed 255 characters and must contain only the following characters: 0-9, A-Z, a-z, underscore (_), or hyphen (-).
- For Endpoint URI, specify the catalog service URI as
https://biglake.googleapis.com/iceberg/v1/restcatalog. - By default,
Use vended credentialsis enabled. This allows Dremio to connect to the catalog and receive temporary credentials to the underlying storage location. If this is enabled, you do not need to add storage authentication in Advanced Options. - For Allowed Namespaces, add your namespace and uncheck the
Allowed Namespaces include their whole subtreesoption.
Advanced Options
The values you set below depend on your Google BigLake Catalog settings. If you left Use vended credentials enabled on the General tab and your Google BigLake catalog is configured with credential vending mode, follow the Vended Credentials Catalog setup below. If you disabled Use vended credentials on the General tab and your Google BigLake catalog is configured with end-user credentials, follow the End User Catalog setup below.
Replace the placeholders inside <...> with your respective values. For example, a warehouse value could be gs://yourstoragelocationhere.
- Vended Credentials Catalog
- End User Catalog
-
warehouse(property)- Value:
<warehouse> - Description: Google BigLake Catalog location
- Value:
-
rest.auth.type(property)- Value:
org.apache.iceberg.gcp.auth.GoogleAuthManager - Description: Required value for a Google BigLake Catalog source
- Value:
-
header.x-goog-user-project(property)- Value:
<project> - Description: Google project where catalog is located
- Value:
-
gcp.auth.credentials-json(credential)- Value:
<your_ADC_JSON_here> - Description: Provided file allows Dremio to authenticate with the catalog
- Value:
-
warehouse(property)- Value:
<warehouse> - Description: Google BigLake Catalog location
- Value:
-
rest.auth.type(property)- Value:
org.apache.iceberg.gcp.auth.GoogleAuthManager - Description: Required value for a Google BigLake Catalog source
- Value:
-
header.x-goog-user-project(property)- Value:
<project> - Description: Google project where catalog is located
- Value:
-
fs.gs.impl(property)- Value:
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem - Description: Required value for a Google BigLake Catalog source
- Value:
-
fs.gs.auth.service.account.enable(property)- Value:
true - Description: Required value for a Google BigLake Catalog source
- Value:
-
fs.AbstractFileSystem.gs.impl(property)- Value:
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS - Description: Required value for a Google BigLake Catalog source
- Value:
-
fs.gs.project.id(property)- Value:
<project> - Description: Google project where catalog is located
- Value:
-
fs.gs.auth.service.account.email(property)- Value:
<email_for_service_account> - Description: Email of service account being used to access underlying Google Cloud Storage location
- Value:
-
fs.gs.auth.service.account.private.key.id(property)- Value:
<private_key_id> - Description: Private key ID of service account
- Value:
-
dremio.gcs.use_keyfile(property)- Value:
true - Description: Required value for a Google BigLake Catalog source
- Value:
-
gcp.auth.credentials-json(credential)- Value:
<your_ADC_JSON_here> - Description: Provided file allows Dremio to authenticate with the catalog
- Value:
-
fs.gs.auth.service.account.private.key(credential)- Value:
<your_private_key_here> - Description: Private key of service account
- Value:
Cache Options
- Enable local caching when possible: Selected by default. Along with asynchronous access for cloud caching, local caching can improve query performance.
- Max percent of total available cache space to use when possible: Specifies the disk quota, as a percentage, available on any single executor node when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter a percentage in the value field or use the arrows to the far right to adjust the percentage.
Reflection Refresh
You can set the policy that controls how often reflections are scheduled to be refreshed automatically, as well as the time limit after which reflections expire and are removed. See the following options:
| Option | Description |
|---|---|
| Never refresh | Select to prevent automatic reflection refresh. The default is to automatically refresh. |
| Refresh every | How often to refresh reflections, specified in hours, days, or weeks. This option is ignored if Never refresh is selected. |
| Set refresh schedule | Specify the daily or weekly schedule. |
| Never expire | Select to prevent reflections from expiring. The default is to automatically expire after the time limit below. |
| Expire after | The time limit after which reflections expire and are removed from Dremio, specified in hours, days, or weeks. This option is ignored if Never expire is selected. |
Metadata
Metadata options are configured using the following settings.
Dataset Handling
- Remove dataset definitions if underlying data is unavailable (default).
- If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files.
Metadata Refresh
These are the optional Metadata Refresh parameters:
-
Dataset Discovery: The refresh interval for fetching top-level source object names such as databases and tables. Set the time interval using this parameter.
Parameter Description Fetch every You can set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is 1 hour. -
Dataset Details: The metadata that Dremio needs for query planning, such as information needed for fields, types, shards, statistics, and locality. These are the parameters to fetch the dataset information.
Parameter Description Fetch mode You can fetch only from queried datasets. Dremio updates details for previously queried objects in a source. By default, this is set to Only Queried Datasets. Fetch every You can set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is 1 hour. Expire after You can set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is 3 hours.
Privileges
This connection inherits privileges from Project settings. To grant specific users or roles additional privileges in this connection:
- Enter the username or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
- For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.
See Privileges for additional information about privileges.
Edit a Google BigLake Catalog Connection
- On the Open Catalog page, under Connections, right-click the connection and select Settings.
- Update the connection configuration as needed.
- Click Save.
Delete a Google BigLake Catalog Connection
If the source is in a bad state (for example, Dremio cannot authenticate to the source or the source is otherwise unavailable), only users who belong to the ADMIN role can delete the source.
- On the Open Catalog page, under Connections, right-click the connection and select Delete.
- Click Delete to confirm.