Dremio is a distributed system that can be deployed in a public cloud or on premises. A Dremio cluster can be co-located with one of the data sources (Hadoop or NoSQL database), or deployed separately.
Here are the common deployment models:
|AWS (EC2)||S3 or other databases hosted on AWS are the primary data sources||Cloud storage (S3)|
|Azure (VM)||Azure Data Lake Store or other databases hosted on Azure are the primary data sources||Cloud storage (Azure Data Lake Store)|
|Standalone||A Hadoop cluster is not available, and the data is not primarily in a single distributed NoSQL database||Local disks|
|Co-located with Hadoop||Hadoop is the primary data source, or there is a Hadoop cluster near the data sources||HDFS or MapR-FS|
|Co-located with NoSQL||A distributed NoSQL database, such as Elasticsearch or MongoDB, is the primary data source||Local disks|
Additional considerations for high performance:
- There should be a low-latency, high-bandwidth network connection between Dremio and the data sources. A 10 GbE network is recommended when connecting to large data sources that hold terabytes or petabytes of data. When connecting to data on S3, it is better to run Dremio in the same region as the S3 buckets.
- For additional performance benefits running on AWS, reflections can be stored on EBS or EFS instead of S3.
- The Dremio process should have adequate CPU and memory resources. When Dremio is co-located with a Hadoop cluster or NoSQL database, it is important to utilize containers (eg, cgroups, Docker, YARN containers) to ensure adequate resources for each process. Dremio features a high-performance asynchronous engine that minimizes the number of threads and context switches under heavy load, so unless containers are utilized, the operating system may over-allocate resources to other thread-hungry processes on the nodes.