High Availability

Dremio clusters can be made highly available by configuring one active and multiple backup master-coordinator nodes as standbys.

To ensure all nodes can access system metadata, a shared network drive is used. The locking support on the network drive as well as on Dremio's metadata store ensures there is only one active Dremio master-coordinator process.

How HA Failover Works

When the active master-coordinator node fails:

  1. Dremio processes detect the failure, based on a set ZooKeeper timeout, and elect a new Dremio master-coordinator node.
  2. The new master-coordinator node, already on standby, completes the startup using the metadata on the network drive.
  3. The other cluster nodes then re-connect to the new master-coordinator node.

[info] Note: When there is a failure, Dremio processes are responsible for killing themselves.

Example: HA Failover

Two (2) master-coordinator nodes (NodeA and NodeB) are configured and started.

  • NodeA starts and NodeB remains waiting on the filesystem lock to finish starting.
  • NodeB is passive and not available until NodeA goes down.
  • When NodeA goes down, NodeB completes the startup process and the other cluster nodes switch their master-coordinator node interaction from NodeA to NodeB.

After Failover

After HA failover is complete:

  • You need to restart queries that were being processed at the time of the failure. This is because the Dremio cluster can't execute new queries until the other cluster nodes are re-connected to the new master-coordinator node.
  • You need to manually restart the failed master-coordinator nodes (after ensuring that it is usable). In this case, when it is restarted, it is brought back as a standby.

[info] To see whether a master-coordinator node is active or not, use the GET /server_status REST API edpoint or, alternatively, ping the node.

Configuration

See Configuration Overview for more configuration information.

Prerequisites

  • Network drive (NFS) with locking support
  • External Zookeeper configuration
  • nginx for Linux (Optional for Web Application HA & Load Balancing)
  • Ensure that all Dremio master-coordinator nodes have read/write access to a shared network drive.

Master-coordinator Nodes

For all master-coordinator nodes, edit the dremio.conf file and configure the following properties.

  • zookeeper points to the external ZooKeeper quorum.
  • services.coordinator.master.embedded-zookeeper.enabled is set to false.
  • services.coordinator.master.enabled is set to true.
  • paths.db points to the shared network location.

Coordinator and Executor Nodes

In a Dremio HA environment, in addition to the master-coordinator nodes, you can have the following:

  • Zero (0) or more coordinator nodes.
  • One (1) or more executor nodes.

Coordinator Node

To configure a coordinator node, edit the dremio.conf file and set the following properties:

  • zookeeper points to the external ZooKeeper quorum.
  • services.coordinator.master.enabled is set to false.
  • services.coordinator.enabled is set to true.
services: {
  coordinator.enabled: true,
  coordinator.master.enabled: false,
  executor.enabled: false
}

zookeeper: "<host1>:2181,<host2>:2181"

Executor Node

To configure a executor node, edit the dremio.conf file and set the following properties:

  • zookeeper points to the external ZooKeeper quorum.
  • services.coordinator.master.enabled is set to false.
  • services.executor.enabled is set to true.
services: {
  coordinator.enabled: false,
  coordinator.master.enabled: false,
  executor.enabled: true
}
zookeeper: "<host1>:2181,<host2>:2181"

Web Application HA & Load Balancing

Dremio's web application can be made highly available by leveraging more than one coordinator node and a reverse proxy/load balancer.

All web clients connect to a single endpoint rather than directly connecting to an individual coordinator node. These connections are then distributed across available coordinator nodes.

Example: nginx.conf file

In the following sample configuration, all Dremio coordinator nodes are added as upstream servers and are proxied through the nginx server.

user nginx;
worker_processes 1;


error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;




events {
worker_connections 1024;
}




http {
include /etc/nginx/mime.types;
default_type application/octet-stream;


log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';


access_log /var/log/nginx/access.log main;


sendfile on;
#tcp_nopush on;


keepalive_timeout 65;


#gzip on;


include /etc/nginx/conf.d/*.conf;


upstream myapp1 {
server dremio_coordinator_1:9047;
server dremio_coordinator_2:9047;
server dremio_coordinator_3:9047;
}


server {
listen 80;


location / {
proxy_pass http://myapp1;
}
}
}

ODBC/JDBC HA & Load Balancing

Dremio recommends connecting to the ZooKeeper quorum instead of a direct connection to a specific node when using ODBC and JDBC. The query is then routed to and planned by one of the available coordinator nodes.


results matching ""

    No results matching ""