This topic describes HDFS data source considerations and Dremio configuration.
You can query files and directories stored in your HDFS cluster. Dremio supports a number of different file formats. See Files and Directories for more information.
Co-locating Dremio nodes with HDFS datanodes can lead to noticeably reduced data transfer times and more performant query execution.
When HDFS data is stored in the Parquet file format, then optimal performance is achieved by storing one Parquet row group per file, with a file size less than or equal to the HDFS block size. Parquet files that overrun the HDFS block size can negatively impact query times by incurring a considerable amount of filesystem overhead.
Ensure that your Dremio cluster has access to the appropriate ports for each node of your HDFS source. By default this should be port 8020 for an HDFS NameNode (which should be the one specified when adding the source), and port 50010 for HDFS DataNodes (used internally for data transfer).
This section provides HDFS configuration.
To grant the Dremio service user the privilege to connect from any host and to impersonate a user belonging to any group, modify the core-site.xml file with the following values:
<property> <name>hadoop.proxyuser.dremio.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.dremio.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.dremio.users</name> <value>*</value> </property>
To modify the properties to be more restrictive by passing actual hostnames and group names, modify the core-site.xml file with the following values:
<property> <name>hadoop.proxyuser.super.hosts</name> <value>10.222.0.0/16,10.113.221.221</value> </property> <property> <name>hadoop.proxyuser.dremio.users</name> <value>user1,user2</value> </property>
You can enable user-specific file access permissions by turning on impersonation in HDFS sources (check the ‘impersonation’ box in the source connection dialog). Users who access data stored on an HDFS source with impersonation enabled will have their access mediated by the HDFS privileges associated with their Dremio login name, rather than the ones associated with the Dremio daemon.
For example, let’s say a Dremio user named
bobsmith has been granted read access to the
/accounts/CustomerA.txt under the same username in HDFS. However, the
dremio system user
(the user that the dremio daemon runs as) does not have read access to this file.
Unless impersonation was enabled when this HDFS source was added to Dremio,
bobsmith will be unable to query the file.
Enabling impersonation also permits a kind of behavior called ‘ownership chaining.’ Under ownership chaining, HDFS data which is subject to restricted access can be shared with any other Dremio users via the creation of a virtual dataset in a public (non-Home) space.
If you have configured a secondary NameNode and a Dremio HA configuration, you must configure Dremio to reconnect with the secondary NameNode in the event the first NameNode goes down.
To configure a secondary NameNode:
fs.defaultFsparameter and value is specified in the core-site.xml file without the port number. (The port is already specified in the URI.)
dfs.nameservices - (say this value is my cluster) dfs.ha.namenodes.mycluster - (say this value is nn1, nn2) dfs.namenode.rpc-address.mycluster.nn1 dfs.namenode.rpc-address.mycluster.nn2 dfs.client.failover.proxy.provider.mycluster
For more information on NameNode HA in Cloudera or Hortonworks, see:
The HDFS source is usually configured when you are adding a new source, especially the Name and connection parameters, however, additional options can be changed or added by editing an existing source.
The advanced options tab has the following values:
You can specify which users can edit. Options include: