Hive

Dremio and Hive

Dremio supports the following:

  • Hive table access using Hive's out-of-the-box SerDes interface, as well as custom SerDes or InputFormat/OutputFormat.
  • Hive-supported reading file format using Hive's own readers -- even if Dremio does not support them natively.

[info] Note

Dremio does not support Hive views. However, you can create and query virtual datasets instead.

Table Statistics

By default, Dremio utilizes its own estimates for Hive table statistics when planning queries.

However, if you want to use Hive's own statistics, do the following:

  1. Set the store.hive.use_stats_in_metastore parameter to true.
    Example: true: store.hive.use_stats_in_metastore

  2. Run the ANALYZE TABLE COMPUTE STATISTICS command for relevant Hive tables in Hive. This step is required so that all of the tables (that Dremio interacts with), have up-to-date statistics.

    ANALYZE TABLE <Table1> [PARTITION(col1,...)] COMPUTE STATISTICS;
    

Dremio Configuration

Configuration is primarily accomplished through either the General or Advanced Options.

General

  • Name -- Hive source name
  • Connection -- Hive connection and security
    • Hive Metastore Host -- IP address. Example: 123.123.123.123
    • Port -- Port number. Default: 9083
    • Enable SASL -- Box to enable SASL. If you enable SASL, specify the Hive Kerberos Principal.
  • Authorization -- Authorization type for the client. When adding a new Hive source, you have the following client options for Hive authorization:
    • Storage Based with User Impersonation -- A storage-based authorization in the Metastore Server which is commonly used to add authorization to metastore server API calls. Dremio utilizes user impersonation to implement Storage Based authorization.
    • SQL Based -- A SQL standards-based authorization option that allows Hive to be fully SQL compliant in its authorization model without causing backward compatibility issues.
    • Ranger Based -- An Apache Ranger plug-in that provides a security framework for authorization.
      • Ranger Service Name - This field corresponds to the security profile in Ranger. Example: hivedev
      • Ranger Host URL - This field is the path to the actual Ranger server. Example: http://yourhostname.com:6080

Advanced Options

  • Impersonation User Delegation -- Specifies whether an impersonation username is one of the following:
    • As is (Default)
    • Lowercase
    • Uppercase
  • Connection Properties -- Name and value of each Hive connection property.

Reflection Refresh

  • Never refresh -- Specifies how often to refresh based on hours, days, weeks, or never.
  • Never expire -- Specifies how often to expire based on hours, days, weeks, or never.

    Metadata

Dataset Handling

  • Remove dataset definitions if underlying data is unavailable (Default).
    If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files.

Metadata Refresh

  • Dataset Discovery -- Refresh interval for top-level source object names such as names of DBs and tables.
    • Fetch every -- Specify fetch time based on minutes, hours, days, or weeks. Default: 1 hour
  • Dataset Details -- The metadata that Dremio needs for query planning such as information needed for fields, types, shards, statistics, and locality.
    • Fetch mode -- Specify either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
      • Only Queried Datasets -- Dremio updates details for previously queried objects in a source.
        This mode increases query performance because less work is needed at query time for these datasets.
      • All Datasets -- Dremio updates details for all datasets in a source. This mode increases query performance because less work is needed at query time.
      • As Needed -- Dremio updates details for a dataset at query time. This mode minimized metadata queries on a source when not used, but might lead to longer planning times.
    • Fetch every -- Specify fetch time based on minutes, hours, days, or weeks. Default: 1 hour
    • Expire after -- Specify expiration time based on minutes, hours, days, or weeks. Default: 3 hours
  • Authorization -- Used when impersonation is enabled. Specifies the maximum of time that Dremio caches authorization information before expiring.
    • Expire after - Specifies the expiration time based on minutes, hours, days, or weeks. Default: 1 day

      Sharing

You can specify which users can edit. Options include:

  • All users can edit.
  • Specific users can edit.

    For More Information


results matching ""

    No results matching ""