Google Cloud Storage (GCS)

Dremio allows for integration with environments using the Google Cloud Storage (GCS) web service for storing data. Configuration of this source allows for direct access to GCS data through the Dremio interface.

Software Requirements

General Tab

The following options are available from the General tab.

Field Description
Name A name to identify the data source in Dremio.
Google Project ID The specific ID for your GCS project. This can be found in the **Project info** pane at the top-left of your screen when at the Home page.
Service Account Keys The most common method of integrating Dremio with GCS is through the creation of a service account key. When this option is selected, you need to provide values for the Client Email, Client ID, Private Key ID, and Private Key fields.

To obtain a service account key to fill in these fields, follow these steps.
Automatic/Service Account Select this option if you are currently running Dremio on a Google Compute instance. Dremio will then use the active service account for your account and does not require any additional information to integrate with your data.
Client Email The email address associated with the GCS service account. This is only required when the Service Account Keys radio button is selected.
Client ID The client ID for your key pair. The value is found by following the steps below to create a service account key. This is only required when the Service Account Keys radio button is selected.
Private Key ID The key ID for your key pair. The value is found by following the steps below to create a service account key. This is only required when the Service Account Keys radio button is selected.
Private Key The private key for your key pair. The value is found by following the steps below to create a service account key. This is only required when the Service Account Keys radio button is selected.

Creating Service Account Keys

In order to use Dremio to access your Google Cloud Storage source, you need to first identify the service account. This is done by creating public/private key pairs. When creating service account keys, the public portion is stored on Google Cloud, while the private portion is made available to you for entry on Dremio.

The steps below outline the most simple method of creating a service account key.

  1. From the Google Cloud Console, navigate to the Service Accounts page.
  2. Select the desired project.
  3. Click on the email address of the service account that you’ll be creating a key for.
  4. Click on the Keys tab.
  5. Click the Add Key drop-down menu and then select Create new key.
  6. Select JSON as the Key Type and then click Create.

Your browser then downloads a service account key file. It should look similar to the example below:

{
  "type": "service_account",
  "project_id": "project-id",
  "private_key_id": "key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n",
  "client_email": "service-account-email",
  "client_id": "client-id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
}

Copy and paste each value from this file to the corresponding fields on the Dremio interface.

For additional methods of creating a key (e.g., gcloud tool, REST APIs, etc.), view Google’s documentation.

Advanced Options Tab

The following settings control more advanced functionalities in Dremio.

Field Description
Enable asynchronous access when possible Allows for multiple queries simultaneously rather than waiting for a single query to complete before new tasks are performed.
Root Path The root path for the GCS source.
Properties Additional connection properties, consisting of the property and its specified value.
Whitelisted buckets A list of buckets to whitelist, or allow access to.
Cache Options
Enable local caching when possible Dremio creates caches locally of any data used from the source.
Max percent of total available cache space to use when possible Sets the allowable amount of local caching, based on percentage. Only the percent specified of the cached files will be stored locally. By default, this is set to 100.

Reflection Refresh Tab

This tab controls the frequency of reflection refreshes or the timespan for expiration for any queries performed using this data source.

Field Description
Never refresh Prevents any query reflections associated with this source from refreshing.
Refresh every Sets the time interval by which reflections for this source are refreshed. This may be set to hours, days, and weeks.
Never expire Prevents any query reflections associated with this source from expiring.
Expire after Sets the time after a reflection is created that it then expires and can no longer be used for queries. This may be set to hours, days, and weeks.

Metadata Tab

This tab offers settings that control how dataset details are fetched and refreshed.

Field Description
Dataset Handling
Remove dataset definitions if underlying data is unavailable If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files.
Automatically format files into physical datasets when users issue queries If this box is checked and a query runs against the un-promoted PDS/folder, Dremio automatically promotes using default options. If you have CSV files, especially with non-default options, it might be useful to not check this box.
Metadata Refresh
Dataset Discovery Specifies the refresh interval for top-level source object names, such as database and table names. This is a lightweight operation.
  • Fetch every. Specifies the time interval by which Dremio fetches object names. This can be set by minutes, hours, days, and weeks.
Dataset Details Specifies the metadata that Dremio needs for query planing, such as information regarding fields, types, shards, statistics, and locality.

  • Fetch mode. Restricts when metadata is retrieved.
    • Only Queried Datasets. Dremio updates metadata details for previously-queried objects in a source. This mode increases query performance as it requires less work to be done at query time for these datasets.
    • All Datasets (deprecated). Dremio updates the details for all datasets in a source. This mode increases query performance as less work is needed to be done at the time of query.
  • Fetch every. Specifies the time interval by which metadata is fetched. This can be set by minutes, hours, days, and weeks.
  • Expire after. Specifies the timespan for when dataset details expire after a dataset is queried. This can be set by minutes, hours, days, and weeks.

Privileges Tab

From this tab, administrators may control access to the data source on a user-by-user or group-by-group basis.

For additional information view the Users, Groups, and Roles page. If you’re using Dremio v16.0+, then please view the new Access Control functionality.