Setup and Best Practices
Container Location Databases (CLDBs)
When adding a MapR-FS data source, be sure to list each node that runs a CLDB in your cluster. This will allow Dremio to continue to query the source in the event of a CLDB node failure.
For all but the most robust network hardware, colocating Dremio nodes with MapR-FS datanodes can lead to noticeably reduced data transfer times and more performant query execution.
Parquet File Performance
When HDFS data is stored in the Parquet file format, then optimal performance is achieved by storing one Parquet row group per file, with a file size less than or equal to the MapR-FS chunk size. Parquet files that overrun the MapR-FS chunk size can negatively impact query times by incurring a considerable amount of filesystem overhead.
NOTE: Ensure that your Dremio cluster has access to the appropriate ports for each node of your MapR-FS source. By default this should be port 7222 for CLDB processes (which should be the one specified when adding the CLDBs of the cluster in the source dialog), as well as ports 5660 and 6660 which are used for internal purposes.
MapR Cluster Names
Dremio does not support MapR cluster names that are non-URI qualified (e.g. containing "_" character). Instead users should use an alias. This alias has to be added to
mapr-clusters.conf on all the nodes of the cluster.
Here is a sample entry and command to generate a maprticket for a given alias:
mycluster_test secure=true 22.214.171.124:7222 bestcluster secure=true 126.96.36.199:7222 maprlogin password -cluster bestcluster
Dremio and MapR-FS
Impersonation and Ownership Chaining
You can enable flexible control over file permissions by turning on impersonation in MapR-FS sources (check the 'impersonation' box in the source connection dialog). This means that users who access data stored on this source will have their access mediated by the MapR-FS privileges associated with their Dremio login name, rather than the ones associated with the Dremio daemon.
Enabling impersonation also permits a kind of behavior called 'ownership chaining.' Under ownership chaining, MapR-FS data which is subject to restricted access can be shared with any other Dremio users via the creation of a virtual dataset in a public (non-Home) space.
Here are all available source specific options:
|Cluster Name||MapR Cluster name.|
|Encrypt Connection||Whether the cluster is secure or not.|
|Root Path||Root path for the MapR-FS source.|
|Properties||A list of additional MapR-FS connection properties.|