AWS Glue Data Catalog
The AWS Glue Data Catalog is a metadata store that lets you store and share metadata in the AWS Cloud. In order to connect to the AWS Glue Data Catalog as a source, you must configure both your AWS and Dremio accounts.
Supported Formats
Dremio can query the data stored in S3 in file formats (including delimited, Excel (XLSX), and Parquet) and Apache Iceberg or Delta Lake table formats.
Configuring Your AWS Account
To allow Dremio to access the metadata store in your AWS Glue Data Catalog, you must configure your AWS account using one of the following authentication methods: project data credentials or data source credentials. Additionally, to enable Dremio to access the Data Catalog, you need to modify IAM permissions during configuration by attaching the IAM policy, which is provided after the authentication instructions.
Review each authentication method below and choose the one that best meets your needs.
Authentication Using Project Data Credentials
Use project data credentials to enable Dremio to access the Data Catalog using the IAM role that is associated with your Dremio project. This IAM role was created when you signed up for Dremio and is the default credential that is used to access all the sources in your project.
For this option, you will attach the IAM policy template to your Dremio project's IAM role that enables Dremio to access the AWS Glue Data Catalog.
For instructions to set up the policy and attach them to an IAM role, see Set up AWS IAM Permissions.
Authentication Using Data Source Credentials
Use data source credentials to access the AWS Glue Data Catalog using either a source-specific access key or an IAM role.
The project key/role assumes this source-specific role that you created to access the Data Catalog. To use source-specific credentials:
- Create an IAM role/user and add policies to provide Dremio access to Data Catalog.
- Modify the project key/role to grant it permissions to assume the source-specific role you created.
For steps on how to add IAM policies, see:
- Creating an IAM Role if you are using an IAM role.
- Creating an IAM User if you are using an IAM user.
IAM Policy Template for Accessing the AWS Glue Data Catalog
The following IAM policy contains the minimum policy requirements to allow Dremio Cloud to read and query the Data Catalog.
IAM policy template for AWS Glue Catalog{
"Version": "2012-10-17",
"Statement": [
# Allow Dremio to run the listed AWS Glue API operations.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTableVersions",
"glue:GetTables",
"glue:GetConnection",
"glue:GetConnections",
"glue:GetDevEndpoint",
"glue:GetDevEndpoints",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
# Allow Dremio to read and write files in a bucket.
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*"
]
},
# Allow Dremio to access the Amazon S3 buckets or folders with names containing either the 'aws-glue-' or 'crawler-public' prefixes.
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue-*"
]
},
# Allow Dremio to create or delete tags in the AWS Glue catalog.
{
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]
}
Adding an AWS Glue Data Catalog
To add an AWS Glue Data Catalog to your project:
- From the Datasets page, in the Data Lakes section, click (+) Add Source.
Alternatively, click Data Lakes to display all data lake sources. From the top-right of the page, click the Add Data Lake button.
- In the Add Data Source dialog, under Metastores, click AWS Glue Data Catalog.
The New AWS Glue Data Catalog Source dialog box appears, which contains the following sections:
- General settings: required fields to add a Data Catalog
- Optional settings: Advanced Options, Refraction Refresh, Metadata, Privileges
Refer to the following for guidance on how to complete each section.
General
Users with proper privileges can configure access to an AWS Glue catalog.
Name
Specify a name for the data source. You cannot change the name after the source is created. The name cannot include the following special characters: /
, :
, [
, or ]
.
AWS Region Selection
Specify a region from which you want to see the tables from AWS Glue. Only tables from this region will be shown after the connection is made.
Authentication
Select your preferred authentication method:
-
Choose Project Data Credentials if you prefer to use the default credentials that allow access to all sources in your project. These credentials were added when you originally signed up for Dremio and created your project. For setup instructions using this method, see Project Data Credentials with Access Key/IAM Role.
-
Choose Data Source Credentials if you prefer to use credentials to access a specific source. These credentials are created as part of the source configuration setup. You can choose to set up these credentials using either an access key or IAM role.
- To set up the credentials using an access key, see Data Source Credentials with Access Key.
- To set up the credentials using an IAM role, see Data Source Credentials with IAM Role.
Allowed Databases
The allowed databases configuration is a post-connection filter on the databases visible from AWS Glue. When selective access to the databases within AWS Glue is required, the allowed databases filter will limit access within Dremio to only the needed databases per source connection, thus improving data security and source metadata refresh performance.
When the allowed database filter is empty, all databases from the AWS Glue source are visible in Dremio. When a database is added or removed from the filter, Dremio performs an asynchronous update to expose new databases and remove databases not included in the filter. Each entry in the allowed database filter must be a valid database name; misspelled or nonexistent databases are ignored.
Encrypted Connection
To secure the connections between AWS Glue and Dremio, tick the Encrypt connection checkbox.
Advanced Options
All configurations are optional.
Enable asynchronous access when possible
Activated by default, enables cloud caching for the Data Catalog while adding a new source or editing it. Clear the checkbox to deactivate asynchronous access.
Connection Properties
You can add key value pairs to provide custom connection properties relevant to the source.
-
Click Add Property.
-
For Name, enter a connection property.
-
For Value, enter the corresponding connection property value.
Lake Formation
Lake Formation provides access controls and allows administrators to define security policies. Enabling this functionality and additional details on the configuration options below are described in more detail on the Integrating with Lake Formation page.
- Enforce AWS Lake Formation access permissions on datasets. Dremio checks any datasets included in the AWS Glue source for the required permissions to perform queries.
- Prefix to map Dremio users to AWS ARNs. Leave blank to default to the end user's username, or enter a REGEX expression.
- Prefix to map Dremio groups to AWS ARNs. Leave blank to default to the end user's group, or enter a REGEX expression.
Cache Options
Enable local caching when possible -- Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option. For more information about local caching, see Columnar Cloud Cache.
Max percent of total available cache space to use when possible -- Specifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter in a percentage in the value field or use the arrows to the far right to adjust the percentage.
Reflection Refresh
Click Reflection Refresh in the left menu sidebar. This section lets you manage how often reflections are refreshed and how long data can be served before expiration. To learn more about reflections, see Refreshing Reflections.
All reflection parameters are optional.
You can set the following refresh policies for reflections:
- Refresh period: Manage the refresh period by either enabling the option to never refresh or setting a refresh frequency in hours, days, or weeks. The default frequency to refresh reflections is every hour.
- Expiration period: Set the expiration period for the length of time that data can be served by either enabling the option to never expire or setting an expiration time in hours, days, or weeks. The default expiration time is set to three hours.
Metadata
Click Metadata in the left menu sidebar. This section lets you configure settings to refresh metadata and enable other dataset options.
All metadata parameters are optional.
You can configure Dataset Handling and Metadata Refresh parameters.
Dataset Handling
You can review each option provided in the following table to set up the dataset handling options to meet your needs.
Parameter | Description |
---|---|
Remove dataset definitions if underlying data is unavailable | By default, Dremio Cloud removes dataset definitions if underlying data is unavailable. This option is for scenarios when files are temporarily deleted and added back in the same location with new sets of files. |
Metadata Refresh
The Metadata Refresh parameters include Dataset Discovery and Dataset Details.
-
Dataset Discovery: The refresh interval for fetching top-level source object names such as databases and tables. Use this parameter to set the time interval.
You can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is 1 hour. -
Dataset Details: The metadata that Dremio needs for query planning such as information required for fields, types, shards, statistics, and locality. The following table describes the parameters that fetch the dataset information.
Parameter Description Fetch mode You can choose to fetch only from queried datasets that are set by default. Dremio updates details for previously queried objects in a source. Fetching from all datasets is deprecated. Fetch every You can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is one hour. Expire after You can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is three hours.
Privileges
Click Privileges in the left menu sidebar. This section lets you grant privileges to specific users or roles. To learn more about how Dremio allows for the implementation of granular-level privileges, see Privileges.
All privileges parameters are optional.
To add a privilege for a user or to a role:
- In the Add User/Role field, enter the user or role name that you want to apply privileges to.
- Click Add to Privileges. The user or role is added to the Users table.
To set privileges for a user or to a role:
- In the Users table, identify the user to set privileges for and click under the appropriate column (Select, Alter, Create Table, etc.) to either enable or disable that privilege. A green checkmark indicates that the privilege is enabled.
- Click Save.
After you have connected Dremio to the AWS Glue Data Catalog, you’ll be able to edit the Data Catalog and remove it when it is no longer needed.
Editing the AWS Glue Data Catalog
To edit a Data Catalog source:
- From the Datasets page, on the bottom-left of the page, click Data Lakes at the bottom-left of the page. A list of data lakes displays.
- In the All Data Lakes section, under the Action column, hover over a data lake to display the hidden Settings (gear) icon, and click the icon > Edit Details.
Alternatively, you can click the name of the data lake, and, from the resulting data lake page, on the upper-right of the page, click the Settings (gear) icon.
- In the Edit Source dialog box, General settings tab, you can update the Authentication credentials. Additionally, you can make changes to any of the optional settings, including Advanced Options, Reflection Refresh, Metadata, and Privileges. For information about these settings and guidance on the changes you can make, see Adding an AWS Glue Data Catalog.
In the General tab, you cannot change the name of the AWS Glue source.
- Click Save.
Removing an AWS Glue Data Catalog Source
To remove a Data Catalog:
- From the Datasets page, on the bottom-left of the page, click Data Lakes. A list of data lakes displays.
- In the All Data Lakes section, under the Actions column, hover over a data lake to display the hidden Settings (gear) icon, and click the icon > Remove Source.
- Confirm that you want to remove the source.
Sources containing a large number of files or tables may take longer to be removed. During this time, the source name is grayed out and shows a spinner icon, indicating the source is being removed. Once complete, the source disappears.
Limitations
- VPC-restricted S3 buckets are not supported.