AWS Glue Data Catalog
Dremio supports Amazon S3 datasets cataloged in AWS Glue as a Dremio data source.
Files in S3 must be one of the following file formats or table formats:
- Apache Iceberg
- Delimited text files (CSV/TSV)
- Delta Lake (Dremio supports reading Native Delta Lake tables in AWS Glue. Delta Lake symlink tables must be crawled and native Delta Lake tables created from them. See Introducing native Delta Lake table support with AWS Glue crawlers in the AWS Big Data blog.)
- ORC
- Parquet
Amazon Glue data sources added to projects default to using the Apache Iceberg table format. When upgrading, Amazon Glue data sources added to projects before Dremio 22 are modified to use the Apache Iceberg table format as the default format.
AWS Glue Credentials
Dremio administrators need credentials to access files in Amazon S3 and list databases and tables in the Glue Catalog. Dremio recommends using the provided sample AWS managed policy when configuring a new Glue Catalog data source. See Dremio Configuration for more information about supported authentication mechanisms.
Dremio reads the table metadata from Glue and directly scans the data on S3 using its high-performance, massively parallel processing (MPP) engine. For this reason, you need to give permissions to connect to Glue as well as the permissions to read the data on S3 for those tables.
AWS IAM Policy for Accessing Amazon S3 and AWS Glue
Dremio recommends using the following AWS managed policy:
IAM policy for accessing Amazon S3 and AWS Glue{
"Version": "2012-10-17",
"Statement": [
# Allow Dremio to run the listed AWS Glue API operations.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTableVersions",
"glue:GetTables",
"glue:GetConnection",
"glue:GetConnections",
"glue:GetDevEndpoint",
"glue:GetDevEndpoints",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
# Allow Dremio to read and write files in a bucket.
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*"
]
},
# Allow Dremio to access the Amazon S3 buckets or folders with names containing either the 'aws-glue-' or 'crawler-public' prefixes.
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue-*"
]
},
# Allow Dremio to create or delete tags in the Glue catalog.
{
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]
}
Configuring AWS Glue Data Catalog as a Source
- On the Datasets page, to the right of Sources in the left panel, click .
- In the Add Data Source dialog, under Metastores, select AWS Glue Data Catalog.
General
Users with proper privileges can configure access to AWS Glue Catalog with one of the three authentication methods.
Name
Specify a name for the data source. You cannot change the name after the source is created. The name cannot include the following special characters: /
, :
, [
, or ]
.
AWS Region Selection
Specify a region from which you want to see the tables from Glue. Only tables from this region will be shown after the connection is made.
Authentication
Choose one of the following authentication methods:
-
AWS Access Key: All or allowed (if specified) buckets associated with this access key or IAM role to assume, if provided, will be available.
-
Under AWS Access Key, enter the AWS access key ID.
-
Under AWS Access Secret, provide the AWS access secret using one of the following methods:
-
Dremio: Provide the AWS access secret in plain text. Dremio stores the AWS access secret.
-
Azure Key Vault: Provide the URI for the Azure Key Vault secret that stores the AWS access secret. The URI format is
https://<vault_name>.vault.azure.net/secrets/<secret_name>
(for example,https://myvault.vault.azure.net/secrets/mysecret
).noteTo use Azure Key Vault as your application secret store, you must:
- Deploy Dremio on Azure.
- Complete the Requirements for Authenticating with Azure Key Vault.It is not necessary to restart the Dremio coordinator when you rotate secrets stored in Azure Key Vault. Read Requirements for Secrets Rotation for more information.
-
AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the AWS access secret, which is available in the AWS web console or using command line tools.
-
HashiCorp Vault: Choose the HashiCorp secrets engine you're using from the dropdown menu and enter the secret reference for the AWS access secret in the correct format in the provided field.
-
-
Under IAM Role to Assume, enter the IAM role that Dremio should assume in conjunction with the AWS Access Key authentication method.
-
-
EC2 Metadata: All or allowed (if specified) buckets associated with the specified IAM role attached to EC2 or IAM role to assume, if provided, will be available.
- Under IAM Role to Assume, enter the IAM role that Dremio should assume in conjunction with the EC2 Metadata authentication method.
-
AWS Profile: Dremio sources profile credentials from the specified AWS profile. For information on how to set up a configuration or credentials file for AWS, see AWS Custom Authentication.
- AWS profile (optional): The AWS profile name. If this is left blank, then the default profile will be used. For more information about using profiles in a credentials or configuration file, see AWS's documentation on Configuration and credential file settings.
The Encrypt connection option is enabled by default to encrypt the connection to AWS Glue. Clear the checkbox to disable encryption.
Allowed Databases
If you want too limit the list of databases that are accessible via Dremio, add one or more databases that you want to allow access to. By default, all databases in a Glue catalog are accessible. By adding allowed databases, you limit access to those databases only. Databases entered must be valid. Misspelled or non-existent databases will not appear in the resulting source. If you add or remove one or more databases from the list, access to them is revoked immediately after you click Save.
Advanced Options
All configurations are optional.
Enable asynchronous access when possible
By default, Dremio enables asynchronous access and local caching when possible so that asynchronous requests do not wait for data to return from S3. Activating this option can enable faster query times.
Connection Properties
A list of additional connection properties that can be specified to use with the connection.
Locations in which Iceberg Tables are Created
Where the CREATE TABLE command creates an Iceberg table depends on the type of data source being used. For AWS Glue Data Sources, the root directory is assumed by default to be /user/hive/warehouse
. If you want to create tables in a different location, you must specify the S3 address of an Amazon S3 bucket in which to create them:
- On the Advanced Options page of the Edit Source dialog, add this connection property:
hive.metastore.warehouse.dir
. - Set the value to the S3 address of an S3 bucket.
The schema path and table name are appended to the root location to determine the default physical location for a new Iceberg table.
Lake Formation Integration
Lake Formation provides access controls and allows administrators to define security policies. Enabling this functionality and additional details on the configuration options below are described in more detail on the Integrating with Lake Formation page.
- Enforce AWS Lake Formation access permissions on datasets. Dremio checks any datasets included in the AWS Glue source for the required permissions to perform queries.
- Prefix to map Dremio users to AWS ARNs. Leave blank to default to the end user's username, or enter a REGEX expression.
- Prefix to map Dremio groups to AWS ARNs. Leave blank to default to the end user's group, or enter a REGEX expression.
Reflection Refresh
Specify how frequently Dremio refreshes Data Reflections based on the AWS Glue data source in the Reflection Refresh
tab. Dremio refreshes every hour and expires after three hours by default.
- Never refresh -- Specifies how often to refresh based on hours, days, weeks, or never.
- Never expire -- Specifies how often to expire based on hours, days, weeks, or never.
Metadata
Specify how and how frequently Dremio refreshes metadata on the Metadata
tab. By default, Dremio fetches top-level objects and dataset details every hour. Dremio retrieves details only for queried datasets by default to improve query performance.
Privileges
On the Privileges tab, you can grant privileges to specific users or roles. See Access Controls for additional information about privileges.
All privileges are optional.
- For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
- For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.
Updating an AWS Glue Data Catalog Source
To update an AWS Glue Data Catalog source:
- On the Datasets page, under Metastores in the panel on the left, find the name of the source you want to edit.
- Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the at the top right corner of the page.
- In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name. For information about the settings options, see Configuring AWS Glue Data Catalog as a Source.
- Click Save.
Deleting an AWS Glue Data Catalog Source
If the source is in a bad state (for example, Dremio cannot authenticate to the source or the source is otherwise unavailable), only users who belong to the ADMIN role can delete the source.
To delete an AWS Glue Data Catalog source, perform these steps:
- On the Datasets page, click Sources > Metastores in the panel on the left.
- In the list of data sources, hover over the name of the source you want to remove and right-click.
- From the list of actions, click Delete.
- In the Delete Source dialog, click Delete to confirm that you want to remove the source.
Deleting a source causes all downstream views that depend on objects in the source to break.