On this page

    AWS Glue Data Catalog

    The AWS Glue Data Catalog is a metadata store that lets you store and share metadata in the AWS Cloud. In order to connect to the AWS Glue Data Catalog as a source, you must configure both your AWS and Dremio accounts.

    Supported Formats

    Dremio can query the data stored in S3 in file formats (including delimited, Excel (XLSX), JSON, and Parquet) and the Apache Iceberg table format.

    Configuring Your AWS Account

    To allow Dremio to access the metadata store in your AWS Glue Data Catalog, you must configure your AWS account using one of the following authentication methods: project data credentials or data source credentials. Additionally, to enable Dremio to access the Data Catalog, you need to modify IAM permissions during configuration by attaching the IAM policy, which is provided after the authentication instructions.

    Review each authentication method below and choose the one that best meets your needs.

    Authentication Using Project Data Credentials

    Use project data credentials to enable Dremio to access the Data Catalog using the IAM role that is associated with your Dremio project. This IAM role was created when you signed up for Dremio and is the default credential that is used to access all the sources in your project.

    For this option, you will attach the IAM policy template to your Dremio project’s IAM role that enables Dremio to access the AWS Glue Data Catalog.

    For instructions to set up the policy and attach them to an IAM role, see Set up AWS IAM Permissions.

    Authentication Using Data Source Credentials

    Use data source credentials to access the AWS Glue Data Catalog using either a source-specific access key or an IAM role.

    The project key/role assumes this source-specific role that you created to access the Data Catalog. To use source-specific credentials:

    1. Create an IAM role/user and add policies to provide Dremio access to Data Catalog.
    2. Modify the project key/role to grant it permissions to assume the source-specific role you created.

    For steps on how to add IAM policies, see:

    IAM Policy Template for Accessing the AWS Glue Data Catalog

    The following IAM policy contains the minimum policy requirements to allow Dremio Cloud to read and query the Data Catalog.

    {
        "Version": "2012-10-17",
        "Statement": [
          # Allow Dremio to run the listed AWS Glue API operations.
          {
              "Effect": "Allow",
              "Action": [
                  "glue:GetDatabase",
                  "glue:GetDatabases",
                  "glue:GetPartition",
                  "glue:GetPartitions",
                  "glue:GetTable",
                  "glue:GetTableVersions",
                  "glue:GetTables",
                  "glue:GetConnection",
                  "glue:GetConnections",
                  "glue:GetDevEndpoint",
                  "glue:GetDevEndpoints",
                  "glue:GetUserDefinedFunction",
                  "glue:GetUserDefinedFunctions",
                  "glue:BatchGetPartition"
              ],
              "Resource": [
                  "*"
                ]
            },
            # Allow Dremio to read and write files in a bucket.
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-glue-*/*",
                    "arn:aws:s3:::*/*aws-glue-*/*"
                ]
            },
            # Allow Dremio to access the Amazon S3 buckets or folders with names containing either the 'aws-glue-' or 'crawler-public' prefixes.
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::crawler-public*",
                    "arn:aws:s3:::aws-glue-*"
                ]
            },
            # Allow Dremio to create or delete tags in the Glue catalog.
            {
                "Effect": "Allow",
                "Action": [
                    "ec2:CreateTags",
                    "ec2:DeleteTags"
                ],
                "Condition": {
                    "ForAllValues:StringEquals": {
                        "aws:TagKeys": [
                            "aws-glue-service-resource"
                        ]
                    }
                },
                "Resource": [
                    "arn:aws:ec2:*:*:network-interface/*",
                    "arn:aws:ec2:*:*:security-group/*",
                    "arn:aws:ec2:*:*:instance/*"
                ]
            }
        ]
    }
    

    Adding an AWS Glue Data Catalog

    To add an AWS Glue Data Catalog to your project:

    1. From the Datasets page, in the Data Lakes section, click (+) Add Source.

    note:

    Alternatively, click Data Lakes to display all data lake sources. From the top-right of the page, click the Add Data Lake button.

    1. In the Add Data Source dialog, under Metastores, click AWS Glue Data Catalog.

    The New AWS Glue Data Catalog Source dialog box appears, which contains the following sections:

    • General settings: required fields to add a Data Catalog
    • Optional settings: Advanced Options, Refraction Refresh, Metadata, Privileges

    Refer to the following for guidance on how to complete each section.

    General

    To add a Data Catalog:

    1. In the Name field, enter a name for the Data Catalog you are connecting to.

    2. For AWS Region Selection, select the AWS Region from the drop-down menu list where the Data Catalog is located.

    3. Under Authentication select your preferred authentication method:

      • Choose Project Data Credentials if you prefer to use the default credentials that allows access to all sources in your project. These credentials were added when you originally signed up for Dremio and created your project. For set up instructions using this method, see Project Data Credentials with Access Key/IAM Role.
      • Choose Data Source Credentials if you prefer to use credentials to access a specific source. These credentials are created as part of the source configuration set up. You can choose to set up these credentials using either an access key or with an IAM role.
    4. (Optional) To secure the connections between the Data Catalog and Dremio, tick the Encrypt connection checkbox.

    Advanced Options

    Click Advanced Options in the left menu sidebar.

    note:

    All advanced options are optional.

    Review each option provided in the following table to set up the advanced options to meet your needs.

    Advanced OptionDescription
    Enable asynchronous access when possibleActivated by default, uncheck the box to deactivate. Enables cloud caching for the Data Catalog while adding a new source or editing it.
    Connection PropertiesProvide the custom key value pairs for the connection relevant to the source.
    1. Click Add Property.
    2. For Name, enter a connection property.
    3. For Value, enter the corresponding connection property value.

    The Lake Formation Integration options provide access controls for datasets in the AWS Glue Data Catalog and allows administrators to define security policies from a centralized location that may be shared across multiple tools. These options are provided as a preview only and not activated yet. The following table provides information about the options that will be available in a future update.

    Lake Formation Integration OptionsDescription
    Enforce AWS Lake Formation access permissions on datasetsAllows Dremio to check the datasets included in this source for required permissions to perform queries.
    Prefix to map Dremio users to AWS ARNsBy default, uses the end user’s username, but you can enter a REGEX expression.
    Prefix to map Dremio groups to AWS ARNsBy default, uses end user’s group, but you can enter a REGEX expression.

    Under Cache Options, review the following table and edit the options to meet your needs.

    Cache OptionsDescription
    Enable local caching when possibleSelected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option. For more information about local caching, see Columnar Cloud Cache.
    Max percent of total available cache space to use when possibleSpecifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter in a percentage in the value field or use the arrows to the far right to adjust the percentage.

    Reflection Refresh

    Click Reflection Refresh in the left menu sidebar. This section lets you manage how often reflections are refreshed and how long data can be served before expiration. To learn more about reflections, see Refreshing Reflections.

    note:

    All reflection parameters are optional.

    You can set the following refresh policies for reflections:

    • Refresh period: Manage the refresh period by either enabling the option to never refresh or setting a refresh frequency in hours, days, or weeks. The default frequency to refresh reflections is every hour.
    • Expiration period: Set the expiration period for the length of time that data can be served by either enabling the option to never expire or setting an expiration time in hours, days, or weeks. The default expiration time is set to three hours.

    Metadata

    Click Metadata in the left menu sidebar. This section lets you configure settings to refresh metadata and enable other dataset options.

    note:

    All metadata parameters are optional.

    You can configure Dataset Handling and Metadata Refresh parameters.

    Dataset Handling

    You can review each option provided in the following table to set up the dataset handling options to meet your needs.

    ParameterDescription
    Remove dataset definitions if underlying data is unavailableBy default, Dremio Cloud removes dataset definitions if underlying data is unavailable. This option is for scenarios when files are temporarily deleted and added back in the same location with new sets of files.
    Metadata Refresh

    The Metadata Refresh parameters include Dataset Discovery and Dataset Details.

    • Dataset Discovery: The refresh interval for fetching top-level source object names such as databases and tables. Use this parameter to set the time interval.
      You can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is 1 hour.

    • Dataset Details: The metadata that Dremio needs for query planning such as information required for fields, types, shards, statistics, and locality. The following table describes the parameters that fetch the dataset information.

      ParameterDescription
      Fetch modeYou can choose to fetch only from queried datasets that are set by default. Dremio updates details for previously queried objects in a source. Fetching from all datasets is deprecated.
      Fetch everyYou can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is one hour.
      Expire afterYou can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is three hours.

    Privileges

    Click Privileges in the left menu sidebar. This section lets you grant privileges to specific users or roles. To learn more about how Dremio allows for the implementation of granular-level privileges, see Privileges.

    note:

    All privileges parameters are optional.

    To add a privilege for a user or to a role:

    • In the Add User/Role field, enter the user or role name that you want to apply privileges to and then click Add to Privileges. The user or role is added to the Users table.

    To set privileges for a user or to a role:

    1. In the Users table, identify the user to set privileges for and click under the appropriate column (Select, Alter, Create Table, etc.) to either enable or disable that privilege. A green checkmark indicates that the privilege is enabled.
    2. Click Save.

    After you have connected Dremio to the AWS Glue Data Catalog, you’ll be able to edit the Data Catalog and remove it when it is no longer needed.

    Editing the AWS Glue Data Catalog

    To edit a Data Catalog source:

    1. From the Datasets page, on the bottom-left of the page, click Data Lakes at the bottom-left of the page. A list of data lakes displays.
    2. In the All Data Lakes section, under the Action column, hover over a data lake to display the hidden Settings (gear) icon, and click the icon > Edit Details.

    note:

    Alternatively, you can click the name of the data lake, and, from the resulting data lake page, on the upper-right of the page, click the Settings (gear) icon.

    1. In the Edit Source dialog box, General settings tab, you can update the Authentication credentials. Additionally, you can make changes to any of the optional settings, including Advanced Options, Reflection Refresh, Metadata, and Privileges. For information about these settings and guidance on the changes you can make, see Adding an AWS Glue Data Catalog.

    note:

    In the General tab, you cannot change the name of the AWS Glue source.

    1. Click Save.

    Removing an AWS Glue Data Catalog Source

    To remove a Data Catalog:

    1. From the Datasets page, on the bottom-left of the page, click Data Lakes. A list of data lakes displays.
    2. In the All Data Lakes section, under the Actions column, hover over a data lake to display the hidden Settings (gear) icon, and click the icon > Remove Source.
    3. Confirm that you want to remove the source.

    Limitations

    • VPC-restricted S3 buckets are not supported.