Skip to main content
Version: current [25.0.x]

HDFS

This topic describes HDFS data source considerations and Dremio configuration.

HBase

HBase is an open-source, non-relational database that is built on top of HDFS and enables real-time analysis of data.

note

Although HBase is no longer officially supported by Dremio as a source connection, you can still add HBase as a Dremio source by using a community connector.

Files stored in HDFS

You can query files and folders stored in your HDFS cluster. Dremio supports a number of different file formats. See Formatting Data to a Table for more information.

Co-location

Co-locating Dremio nodes with HDFS datanodes can lead to noticeably reduced data transfer times and more performant query execution.

Parquet File Performance

When HDFS data is stored in the Parquet file format, then optimal performance is achieved by storing one Parquet row group per file, with a file size less than or equal to the HDFS block size. Parquet files that overrun the HDFS block size can negatively impact query times by incurring a considerable amount of filesystem overhead.

note

Ensure that your Dremio cluster has access to the appropriate ports for each node of your HDFS source. By default, this should be port 8020 for an HDFS NameNode (which should be the one specified when adding the source), and either port 50010 or port 9866 for HDFS DataNodes (dfs.datanode.address, used internally for data transfer).

HDFS Configuration

This section provides HDFS configuration.

Impersonation

To grant the Dremio service user the privilege to connect from any host and to impersonate a user belonging to any group, modify the core-site.xml file with the following values:

User impersonation settings for core-site.xml file
<property>
<name>hadoop.proxyuser.dremio.hosts</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.dremio.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.dremio.users</name>
<value>*</value>
</property>

To modify the properties to be more restrictive by passing actual hostnames and group names, modify the core-site.xml file with the following values:

More restrictive user impersonation settings for core-site.xml file
 <property>
<name>hadoop.proxyuser.super.hosts</name>
<value>10.222.0.0/16,10.113.221.221</value>
</property>
<property>
<name>hadoop.proxyuser.dremio.users</name>
<value>user1,user2</value>
</property>

Impersonation and Privilege Delegation

You can enable user-specific file access permissions by turning on impersonation in HDFS sources (check the 'impersonation' box in the source connection dialog). Users who access data stored on an HDFS source with impersonation enabled will have their access mediated by the HDFS privileges associated with their Dremio login name, rather than the ones associated with the Dremio daemon.

For example, let's say a Dremio user named bobsmith has been granted read access to the file /accounts/CustomerA.txt under the same username in HDFS. However, the dremio system user (the user that the dremio daemon runs as) does not have read access to this file. Unless impersonation was enabled when this HDFS source was added to Dremio, bobsmith will be unable to query the file.

Enabling impersonation also permits a kind of behavior called 'privilege delegation.' Under privilege delegation, HDFS data which is subject to restricted access can be shared with any other Dremio users via the creation of a view in a public (non-Home) space.

NameNode HA Configuration

If you have configured a secondary NameNode and a Dremio HA configuration, you must configure Dremio to reconnect with the secondary NameNode in the event the first NameNode goes down.

To configure a secondary NameNode:

  1. Ensure that fs.defaultFs parameter and value is specified in the core-site.xml file without the port number. (The port is already specified in the URI.)

    Specify fs.defaultFs parameter and value
    <name>fs.defaultFS</name>
    <value>hdfs://xyzcluster</value>
  2. Configure the NameNode HA parameters via one of the following methods:

    • Copy/symlink the Hadoop core-site.xml file to the Dremio conf folder if you haven't already done so.

    • Add the following parameters and values to the HDFS source in the Dremio UI under Advanced Options.

      HDFS source parameters and values
      dfs.nameservices - (say this value is my cluster)
      dfs.ha.namenodes.mycluster - (say this value is nn1, nn2)
      dfs.namenode.rpc-address.mycluster.nn1
      dfs.namenode.rpc-address.mycluster.nn2
      dfs.client.failover.proxy.provider.mycluster
  3. (Optional) Configure your distributed storage to hdfs in the dremio.conf file.

For more information on NameNode HA in Cloudera or Hortonworks, see:

Dremio Configuration

The HDFS source is usually configured when you are adding a new source, especially the Name and connection parameters, however, additional options can be changed or added by editing an existing source.

General

  • Name -- HDFS Name for the source.

  • Connection -- HDFS connection and impersonaton

    • NameNode Host

      • No HA - HDFS NameNode hostname.

        • HA - value for dfs.nameservices from hdfs-site.xml.
    • NameNode Port -- HDFS NameNode port

    • Enable Impersonation -- When enabled, Dremio executes queries against HDFS on behalf of the user.

      • When Allow VDS-based Access Delegation is enabled (default), the owner of the view is used as the impersonated username.

      • When Allow VDS-based Access Delegation is disabled (unchecked), the query user is used as the impersonated username.

Advanced Options

The advanced options tab has the following values:

  • Enable exports into the source (CTAS and DROP)
  • Root Path -- Root path for the HDFS source
  • Short-Circuit Local Reads -- Implementation of short-circuit local reads on which clients directly open the HDFS block files.
    • HDFS Default
    • Enabled
    • Disabled (Default)
  • Impersonation User Delegate -- Specifies whether an impersonation username is one of the following:
    • As is (Default)
    • Lowercase
    • Uppercase
  • Connection Properties -- A list of additional HDFS connection properties.
  • Cache Options
    • Enable local caching when possible
    • Max percent of total available cache space to use when possible. Default: 100

Reflection Refresh

  • Never refresh -- Specifies how often to refresh based on hours, days, weeks, or never.
  • Never expire -- Specifies how often to expire based on hours, days, weeks, or never.

Metadata

Dataset Handling

  • Remove dataset definitions if underlying data is unavailable (Default).
    If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files.
  • Automatically format files into tables when users issue queries. If this box is checked and a query runs against the un-promoted table/folder, Dremio automatically promotes using default options. If you have CSV files, especially with non-default options, it might be useful to not check this box.

Metadata Refresh

  • Dataset Details -- The metadata that Dremio needs for query planning such as information needed for fields, types, shards, statistics, and locality.
    • Fetch mode -- Specify either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
      • Only Queried Datasets -- Dremio updates details for previously queried objects in a source.
        This mode increases query performance because less work is needed at query time for these datasets.
      • All Datasets -- Dremio updates details for all datasets in a source. This mode increases query performance because less work is needed at query time.
      • As Needed -- Dremio updates details for a dataset at query time. This mode minimized metadata queries on a source when not used, but might lead to longer planning times.
    • Fetch every -- Specify fetch time based on minutes, hours, days, or weeks. Default: 1 hour
    • Expire after -- Specify expiration time based on minutes, hours, days, or weeks. Default: 3 hours
  • Authorization -- When impersonation is enabled, the maximum amount of time that Dremio will cache authorization information.

Sharing

You can specify which users can edit. Options include:

  • All users can edit.
  • Specific users can edit.