MapR Deployment (YARN)
This topic describes how to deploy Dremio on MapR in YARN deployment mode.
Architecture
In YARN Deployment mode, Dremio integrates with YARN ResourceManager to secure compute resources in a shared multi-tenant environment. The integration enables enterprises to more easily deploy Dremio on a Hadoop cluster, including the ability to elastically expand and shrink the execution resources. The following diagram illustrates the high-level deployment architecture of Dremio on a MapR cluster.
Key components of the overall architecture:
-
Dremio Coordinator should be deployed on the edge node.
-
Dremio Coordinator is subsequently configured, via the Dremio UI, to launch Dremio Executors in YARN containers. The number of Executors and the resources allocated to them can be managed through the Dremio UI. See system requirements for resource needs of each node type.
-
It is recommended that a dedicated YARN queue be set up for the Dremio Executors in order to avoid resource conflicts.
-
Dremio Coordinators and Executors are configured to use MapR-FS volumes for the cache and spill directories.
-
Dremio implements a watchdog to watch Dremio processes and provides HTTP health checks to kill executor processes that do not shutdown cleanly.
Step 1: Verify MapR-specific Requirements
Please refer to System Requirements for base requirements. The following are additional requirements for YARN (MapR) deployments.
Permissions
-
Installing Dremio requires MapR administrative privileges. Dremio services running on MapR clusters should be run as the
mapr
user or a user account using a service with an impersonation ticket (see MapR 5.2.x, 6.1.x, or 6.2) and have read privileges for the MapR-FS directories/files that will either be queried directly or that map to the Hive Metastore. -
Create a dedicated MapR volume and directory for the Dremio's distributed cache. Dremio user should have read and write permissions.
-
Optionally, create a dedicated YARN queue for Dremio executor nodes with job submission privileges for the Dremio user.
Be sure to run sudo -u mapr yarn rmadmin -refreshQueue
for queue configuration changes to take effect.
<allocations>
<queue name="dremio">
<minResources>320000 mb,160 vcores,0 disks</minResources>
<maxResources>640000 mb,320 vcores,0 disks</maxResources>
<aclSubmitApps>mapr</aclSubmitApps>
</queue>
</allocations>
CPU Configuration
In order for the CPU configuration specified in Dremio to be used and enforced on the YARN side, you need to do the following:
- Enable
CPU
scheduling in YARN. - Enable Linux CGroup enforcement in YARN.
Network Ports
Purpose | Port | From | To |
---|---|---|---|
ZooKeeper (External MapR) | 5181 | Dremio nodes | ZK |
CLDB (MapR) | 7222 | Coordinators | CLDB |
DataNodes (MapR) | 5660 | Dremio nodes | MapR data nodes |
YARN ResourceManager (MapR) | 8032 | Coordinators | YARN RM |
Step 2: Install and Configure Dremio
This step involves installing and configuring Dremio on each node in your cluster.
Installing Dremio
Installation should be done as the mapr
user and not as the dremio
user.
See Installing and Upgrading via RPM or
Installing and Upgrading via Tarball for more information.
Configuring Dremio
When referring to a Dremio coordinator, the configuration is for a master-coordinator role.
Configuring Dremio via dremio.conf
The following properties must be reviewed and or modified.
-
Specify a master-coordinator role for the coordinator node:
Specify master-coordinator roleservices: {
coordinator.enabled: true,
coordinator.master.enabled: true,
executor.enabled: false
} -
Specify a local metadata location that only exists on the coordinator node:
Specify local metadata locationpaths: {
local: "/var/lib/dremio"
...
} -
Specify a distributed cache location for all nodes using the dedicated MapR volume that you created:
Specify distributed cache locationpaths: {
...
dist: "maprfs:///<MOUNT_PATH>/<CACHE_DIRECTORY>"
} -
Specify the MapR ZooKeeper for coordination:
Specify MapR ZooKeeperzookeeper: "<ZOOKEEPER_HOST_1>:5181,<ZOOKEEPER_HOST_2>:5181"
services.coordinator.master.embedded-zookeeper.enabled: false -
OPTIONAL - Set an alternative client end port to avoid port collisions:
Set alternative client end port (optional)services:{
coordinator.client-endpoint.port:31050
}
Configuring Dremio via dremio-env
Specify the path for the MapR ticket if MapR cluster is secure:
Specify path for MapR ticket# For Secure Cluster
export MAPR_TICKETFILE_LOCATION=<MAPR_TICKET_PATH>
Starting the Dremio Daemon
Once configuration is completed, you can start the Dremio Coordinator daemon with the command.
Note that it has to be started with the user that either configured with service ticket or mapr
user.
sudo service dremio start
# OR
sudo -u mapr /opt/dremio/bin/dremio --config /etc/dremio/ start
Accessing the Dremio UI
Open a browser and navigate to http://<COORDINATOR_NODE>:9047
.
The Dremio UI flow walks you through creating the first Admin user.
Step 3: Deploy Dremio Executors on YARN
After you deploy the Dremio Coordinator, follow these steps to deploy Dremio executors:
-
Navigate to the Set Up YARN window by following either of these sets of steps:
-
If your version of Dremio displays a link labeled Admin in the top-right corner, follow these steps:
a. Click Admin in the top-right corner of the screen.
b. In the left panel, select Provisioning.
c. Select YARN, select MapR as your distribution.
-
If your version of Dremio displays a gear icon in a sidebar on the left side of the screen, follow these steps:
a. Click the gear icon.
b. In the Engines section of the left panel, select Elastic Engines.
c. In the upper-right corner, click Add Engine.
d. In the Set Up YARN window, select MapR in the Hadoop Engine field.
-
-
Enter details. Dremio recommends having only one worker (YARN container) per node.
-
In the Resource Manager field, follow either of these steps:
- If Resource Manager HA is not enabled, specify the hostname or IP address of the resource manager.
- If Resource Manager HA is enabled, specify the value of the property
yarn.resourcemanager.cluster-id
, which is in the file yarn-site.xml.
-
In the CLDB field, accept the default of
maprfs:///
.
You can now monitor and manage YARN executor nodes.
Sample Configuration Files
Sample dremio.conf file for a coordinator nodepaths: {
# the local path for dremio to store data.
local: "/var/lib/dremio"
# the distributed path Dremio data including job results, downloads, uploads, etc
dist: "maprfs:///dremio/pdfs"
}
zookeeper: "<MAPR_ZOOKEEPER1>:5181,<MAPR_ZOOKEEPER2>:5181"
services: {
coordinator.enabled: true,
coordinator.master.enabled: true,
executor.enabled: false
}
# For Secure Cluster
export MAPR_TICKETFILE_LOCATION=<MAPR_TICKET_PATH>