High Availability

Dremio clusters can be made highly available by configuring one active and multiple backup master nodes as standbys. In this configuration, Dremio processes are responsible for killing themselves in case of a failure. Users need to manually restart nodes that go down to bring them back as standbys.

To ensure all nodes can access system metadata, a shared network drive is used. The locking support on the network drive as well as on Dremio's metadata store ensures there is only one active Dremio master process.

If the active master fails:

  1. Dremio processes detect the failure, based on a set ZooKeeper timeout and elect a new master.
  2. New master, already on standby, completes startup using the metadata on the network drive.
  3. Other cluster nodes then re-connect to the new master Dremio node.

Dremio cluster can't execute new queries until the last step is complete. Queries that were being processed at the time of the failure need to be restarted.

[info] Example of HA failover

If you start two (2) coordinator nodes both configured as the master node (for example, NodeA and NodeB), NodeA starts and NodeB remains waiting on the filesystem lock to finish starting. NodeB is passive and not available until NodeA goes down. If NodeA goes down, NodeB completes the startup process and the other cluster nodes switch their master node interaction from NodeA to NodeB.

Prerequisites

Configuration

Node and Network Configuration

  • All master nodes must have read/write access to a shared network drive.

Dremio Configuration

All Master Nodes
In dremio.conf:

  • zookeeper points to the external ZooKeeper quorum
  • services.coordinator.master.embedded-zookeeper.enabled is set to false
  • services.coordinator.master.enabled is set to true
  • paths.db points to the shared network location

All Other Nodes
In dremio.conf:

  • zookeeper points to the external ZooKeeper quorum
  • services.coordinator.master.enabled is set to false (Coordinator nodes only)

Web Application HA & Load Balancing

Dremio's web application can be made highly available by leveraging more than one coordinator node and a reverse proxy/load balancer.

All web clients connect to a single endpoint rather than directly connecting to an individual coordinator node. These connections will then get distributed across available coordinator nodes.

In the sample configuration below, all Dremio coordinator nodes are added as upstream servers and are proxied through the nginx server.

Sample nginx.conf file

user nginx;
worker_processes 1;


error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;




events {
worker_connections 1024;
}




http {
include /etc/nginx/mime.types;
default_type application/octet-stream;


log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';


access_log /var/log/nginx/access.log main;


sendfile on;
#tcp_nopush on;


keepalive_timeout 65;


#gzip on;


include /etc/nginx/conf.d/*.conf;


upstream myapp1 {
server dremio_coordinator_1:9047;
server dremio_coordinator_2:9047;
server dremio_coordinator_3:9047;
}


server {
listen 80;


location / {
proxy_pass http://myapp1;
}
}
}

ODBC/JDBC HA & Load Balancing

Dremio recommends connecting to the ZooKeeper quorum instead of a direct connection to a specific node when using ODBC and JDBC. The query is then routed to and planned by one of the available coordinator nodes.


results matching ""

    No results matching ""