AWS Glue Data Catalog
Dremio supports Amazon S3 datasets cataloged in AWS Glue as a Dremio data source.
Files in S3 must be one of the following file formats or table formats:
- Apache Iceberg
- Delimited text files (CSV/TSV)
- Delta Lake (Dremio supports reading Native Delta Lake tables in AWS Glue. Delta Lake symlink tables must be crawled and native Delta Lake tables created from them. See Introducing native Delta Lake table support with AWS Glue crawlers in the AWS Big Data blog.)
AWS Glue Credentials
Dremio administrators need credentials to access files in Amazon S3 and list databases and tables in the Glue Catalog. Dremio recommends using the provided sample AWS managed policy when configuring a new Glue Catalog data source. See Dremio Configuration for more information about supported authentication mechanisms.
Dremio reads the table metadata from Glue and directly scans the data on S3 using its high-performance, massively parallel processing (MPP) engine. For this reason, you need to give permissions to connect to Glue as well as the permissions to read the data on S3 for those tables.
AWS IAM Policy for Accessing Amazon S3 and AWS Glue
Dremio recommends using the following AWS managed policy:IAM policy for accessing Amazon S3 and AWS Glue
# Allow Dremio to run the listed AWS Glue API operations.
# Allow Dremio to read and write files in a bucket.
# Allow Dremio to access the Amazon S3 buckets or folders with names containing either the 'aws-glue-' or 'crawler-public' prefixes.
# Allow Dremio to create or delete tags in the Glue catalog.
Users with proper privileges can configure access to AWS Glue Catalog with one of the three authentication methods.
Specify a name for the data source. You cannot change the name after the source is created.
AWS Region Selection
Specify a region from which you want to see the tables from Glue. Only tables from this region will be shown after the connection is made.
AWS Access Key method -- All or allowed (if specified) buckets associated with this access key or IAM role to assume, if provided, will be available.
EC2 Metadata method -- All or allowed (if specified) buckets associated with the specified IAM role attached to EC2 or IAM role to assume, if provided, will be available.
AWS Profile -- Dremio sources profile credentials from the specified AWS profile. For information on how to set up a configuration or credentials file for AWS, see AWS Custom Authentication.
AWS profile (optional) -- The AWS profile name. If this is left blank, then the default profile will be used. For more information about using profiles in a credentials or configuration file, see AWS's documentation on Configuration and credential file settings.note
If no authentication method is specified, only the buckets provided in Public Buckets will be available.
Encrypt connection -- Enabled by default to encrypt the connection to AWS Glue. Clear the checkbox to disable encryption.
All configurations are optional.
Enable asynchronous access when possible
By default, Dremio enables asynchronous access and local caching when possible so that asynchronous requests do not wait for data to return from S3. Activating this option can enable faster query times.
A list of additional connection properties that can be specified to use with the connection.
Lake Formation Integration
Lake Formation provides access controls and allows administrators to define security policies. Enabling this functionality and additional details on the configuration options below are described in more detail on the Integrating with Lake Formation page.
- Enforce AWS Lake Formation access permissions on datasets. Any datasets included in this source will be checked against by Dremio for required permissions to perform queries.
- Prefix to map Dremio users to AWS ARNs. Leave blank to default to the end user's username, or enter a REGEX expression.
- Prefix to map Dremio groups to AWS ARNs. Leave blank to default to the end user's group, or enter a REGEX expression.
Specify how frequently Dremio refreshes Data Reflections based on the AWS Glue data source in the
Reflection Refresh tab. Dremio refreshes every hour and expires after three hours by default.
- Never refresh -- Specifies how often to refresh based on hours, days, weeks, or never.
- Never expire -- Specifies how often to expire based on hours, days, weeks, or never.
Specify how and how frequently Dremio refreshes metadata on the
Metadata tab. By default, Dremio fetches top-level objects and dataset details every hour. Dremio retrieves details only for queried datasets by default to improve query performance.
On the Privileges page, you can grant privileges to specific users or roles. See Access Controls for additional information about user privileges.
- For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the Users table.
- For the users or roles in the Users table, toggle the green checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.