Hadoop Deployment (YARN)

Deployment Architecture

In YARN Deployment mode, Dremio integrates with YARN ResourceManager to secure compute resources in a shared multi-tenant environment. The integration enables enterprises to more easily deploy Dremio on a Hadoop cluster, including the ability to elastically expand and shrink the execution resources. The diagram below illustrates the high-level deployment architecture of Dremio on a Hadoop cluster.

Key components of the overall architecture:

  • Dremio Coordinator should be deployed on the edge node.
  • Dremio Coordinator is subsequently configured, via the Dremio UI, to launch Dremio Executors in YARN containers. The number of Executors and the resources allocated to them can be managed through the Dremio UI. Seesystem requirements for resource needs of each node type.
  • It is recommended that a dedicated YARN queue be set up for the Dremio Executors in order to avoid resource conflicts.
  • Dremio Coordinators and Executors are configured to use HDFS volumes for the cache and spill directories.

Additional Requirements

Please refer to System Requirements for base requirements. The following are additional requirements for YARN (Hadoop) deployments.

Network

Purpose Port From To
ZooKeeper (External ) 2181 Dremio nodes ZK
Namenode 8020 Coordinators Namenode
DataNodes 50010 Dremio nodes Data nodes
YARN ResourceManager 8032 Coordinators YARN RM

Hadoop

  • A service user (e.g. dremio) that will own the Dremio process. This user must be present on edge and cluster nodes.

  • Dremio user must be granted read privileges for HDFS directories that will be queried directly or that map to Hive tables. This can also be configured using groups in Sentry or Ranger.

  • Create HDFS home directory for the Dremio user. This will be used for storing Dremio's distributed cache.

  • Grant Dremio service user the privilege to impersonate the end user. Here is a sample core-site.xml entry:

    <property>
      <name>hadoop.proxyuser.dremio.hosts</name>
      <value>*</value>
    </property>
    <property>
      <name>hadoop.proxyuser.dremio.groups</name>
      <value>*</value>
    </property>
    
  • Optionally, create a dedicated YARN queue for Dremio executor nodes with job submission privileges for the user running Dremio. Here is a sample for Fair Scheduler by adding a fair-scheduler.xml entry:

    <allocations>
      <queue name="dremio">
        <minResources>1000000 mb,100 vcores,0 disks</minResources>
        <maxResources>1000000 mb,100 vcores,0 disks</maxResources>
        <schedulingPolicy>fair</schedulingPolicy>
      </queue>
    </allocations>
    

    Run the following for queue configuration changes to take affect:

    sudo -u yarn yarn rmadmin -refreshQueues
    

Kerberos

If connecting to a cluster with Kerberos:

  • Create a Kerberos principal for the Dremio user
  • Generate a Keytab file for the Dremio Kerberos principal

Install Dremio Coordinator

You can follow the instructions for RPM/Tarball Installation. This will also create the directories that need to be configured in dremio.conf.

Before proceeding with configuration
Copy core-site.xml, hdfs-site.xml and yarn-site.xml (typically under /etc/hadoop/conf) files into Dremio's conf directory on coordinator node(s).

For Hortonworks deployments, make the following changes in yarn-site.xmlthat you've copied over to Dremio's conf file:

  • Remove entry for FailoverProxyProvider completely.
  • Set yarn.timeline-service.enabled to false.

Configure Dremio Coordinators

In dremio.conf:

  • Set master node and specify coordinator role:

    services: {
    coordinator.enabled: true,
    coordinator.master.enabled: true,
    executor.enabled: false
    ...
    }
    
  • Set local metadata location. This should be a directory that only exists on the coordinator node:

    paths: {
    local: "/var/lib/dremio" 
    ...
    }
    
  • Use the dedicate HDFS directory you've created for the Dremio user as distributed cache location for all nodes:

    paths: {
    ...
    dist: "hdfs://<NAMENODE_HOST>:8020/path"
    # If Name Node HA is enabled, 'fs.defaultFS' should be used 
    # instead of the active name node IP or host when specifying 
    # distributed storage path. 'fs.defaultFS' value can be found 
    # in 'core-site.xml'. (e.g. <value_for_fs_defaultFS>/path)
    }
    
  • Use Hadoop ZooKeeper for coordination:

    zookeeper: "<ZOOKEEPER_HOST_1>:2181,<ZOOKEEPER_HOST_2>:2181"
    services.coordinator.master.embedded-zookeeper.enabled: false
    
  • If using Kerberos, enter principal name and keytab file location:

    services.kerberos: {
      principal: "dremio@REALM.COM", # principal name must be generic and not tied to any host.
      keytab.file.path: "/path/to/keytab/file"
    }
    

In core-site.xml under Dremio's conf directory:

  • If using Kerberos, create a core-site.xml file under Dremio's configuration directory (same directory as dremio.conf) and include the following:
    <property>
        <name>hadoop.security.authentication</name>
        <value>kerberos</value>
    </property>
    

Starting the Dremio Daemon

Once configuration is completed, you can start the Dremio Coordinator daemon with the command:

$ sudo service dremio start

Completing Coordinator Setup

Open a browser and navigate to http://<COORDINATOR_NODE>:9047. UI flow will then walk you through creating the first admin user.

Deploy Dremio Executors on YARN

Once Dremio Coordinator is successfully deployed:

  1. Navigate to the UI > Admin > Provisioning section

  2. Select YARN and then select your Hadoop distribution and configuration. Dremio recommends having only one worker (YARN container) per node.

  3. Configure Resource Manager and NameNode. Resource Manager needs to be specified as a hostname or IP address (e.g. 192.168.0.1) and NameNode needs to be specified with the protocol and port (e.g. hdfs://192.168.0.2:8020)

  4. Configure spill directories. Dremio recommends pointing this to the usercache directory under the path specified in yarn.nodemanager.local-dirs. As an example:

    file:///data1/hadoop/yarn/local/usercache/<DREMIO_SERVICE_USER>/
    file:///data2/hadoop/yarn/local/usercache/<DREMIO_SERVICE_USER>/
    
  5. Monitor and manage YARN executor nodes.

Configuring for Name Node HA and Resource Manager HA

Name Node HA
If Name Node HA is enabled, fs.defaultFS value should be used as the NameNode value instead of the active name node IP or host when configuring provisioning in Dremio UI.

Similarly, when specifying distributed storage (paths.dist in dremio.conf), path should be specific using fs.defaultFS value instead of the active name node. (e.g. <value_for_fs_defaultFS>/path)

fs.defaultFS value can be found in core-site.xml

Resource Manager HA
If Resource Manager HA is enabled, yarn.resourcemanager.cluster-id should be used as the Resource Manager value instead of the active resource manager IP or host when configuring provisioning in Dremio UI.

yarn.resourcemanager.cluster-id value can be found in yarn-site.xml.

Sample Configuration Files

Here are templates for dremio.conf and dremio-env configuration files.

Sample dremio.conf for a master-coordinator node:

services: {
  coordinator.enabled: true,
  coordinator.master.enabled: true,
  executor.enabled: false,
}

paths: {

  # the local path for dremio to store data.
  local: "/var/lib/dremio"

  # the distributed path Dremio data including job results, downloads, uploads, etc
  dist: "hdfs://<NAMENODE_HOST>:8020/path"
  # If Name Node HA is enabled, 'fs.defaultFS' should be used 
  # instead of the active name node IP or host when specifying 
  # distributed storage path. 'fs.defaultFS' value can be found 
  # in 'core-site.xml'. (e.g. <value_for_fs_defaultFS>/path)

}

zookeeper: "<ZOOKEEPER_HOST>:2181"

# optional
services.kerberos: {
    principal: "dremio@REALM.COM", # principal name must be generic and not tied to any host.
    keytab.file.path: "/path/to/keytab/file"
  }

Troubleshooting

  • In YARN deployments using Ranger, access is denied when attempting to query a data source configured to Ranger authorization and Dremio logs a "FileNotFoundException */xasecure-audit.xml (No such file or directory)" error.
    This behavior is triggered within the Ranger plugin libraries when hdfs-site.xml, hive-site.xml, or hbase-site.xml are present in the Dremio configuration path.

    To fix this environment issue, rename the ranger-hive-audit.xml configuration file generated by the Ranger Hive plugin installer to xasecure-audit.xml and copy it to the Dremio configuration path on all coordinator nodes.


results matching ""

    No results matching ""