MapR-FS
Setup and Best Practices
Container Location Databases (CLDBs)
When adding a MapR-FS data source, be sure to list each node that runs a CLDB in your cluster. This will allow Dremio to continue to query the source in the event of a CLDB node failure.
Colocation
For all but the most robust network hardware, colocating Dremio nodes with MapR-FS datanodes can lead to noticeably reduced data transfer times and more performant query execution.
Parquet File Performance
When HDFS data is stored in the Parquet file format, optimal performance can be achieved by storing one Parquet row group per file, with a file size less than or equal to the MapR-FS chunk size. Parquet files that overrun the MapR-FS chunk size can negatively impact query times by incurring a considerable amount of filesystem overhead.
NOTE: Ensure that your Dremio cluster has access to the appropriate ports for each node of your MapR-FS source. By default this should be port 7222 for CLDB processes (which should be the one specified when adding the CLDBs of the cluster in the source dialog), as well as ports 5660 and 6660 which are used for internal purposes.
MapR Cluster Names
Dremio does not support MapR cluster names that are non-URI qualified (e.g. containing "_" character). Instead users should use an alias. This alias has to be added to mapr-clusters.conf
on all the nodes of the cluster.
Here is a sample entry and command to generate a maprticket for a given alias:
Sample entry and command to generate maprticket for aliasmycluster_test secure=true 123.0.0.1:7222
bestcluster secure=true 123.0.0.2:7222
maprlogin password -cluster bestcluster
Dremio and MapR-FS
Impersonation and Privilege Delegation
You can enable flexible control over file permissions by turning on impersonation in MapR-FS sources (check the 'impersonation' box in the source connection dialog). This means that users who access data stored on this source will have their access mediated by the MapR-FS privileges associated with their Dremio login name, rather than the ones associated with the Dremio daemon.
Enabling impersonation also permits a kind of behavior called 'privilege delegation.' Under privilege delegation, MapR-FS data which is subject to restricted access can be shared with any other Dremio users via the creation of a view in a public (non-Home) space.
Configuring MapR-FS as a Source
General
- Cluster Name -- MapR Cluster name. The name cannot include the following special characters:
/
,:
,[
, or]
. - Enable Impersonation -- When enabled, Dremio executes queries against HDFS on behalf of the user.
- When Allow VDS-based Access Delegation is enabled (default), the owner of the view is used as the impersonated username.
- When Allow VDS-based Access Delegation is disabled (unchecked), the query user is used as the impersonated username.
- Encrypt Connection -- Specifies whether the cluster is secure or not.
Advanced Options
- Enable exports into the source (CTAS and DROP).
- Root Path -- Root path for the source.
- Connection Properties -- A list of additional connection properties.
Reflection Refresh
- Never refresh -- Specifies how often to refresh based on hours, days, weeks, or never.
- Never expire -- Specifies how often to expire based on hours, days, weeks, or never.
Metadata
Dataset Handling
- Remove dataset definitions if underlying data is unavailable (Default).
If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files. - Automatically format files into tables when users issue queries. If this box is checked and a query runs against the un-promoted table/folder, Dremio automatically promotes using default options. If you have CSV files, especially with non-default options, it might be useful to not check this box.
Metadata Refresh
- Dataset Details -- The metadata that Dremio needs for query planning such as information needed for fields, types, shards, statistics, and locality.
- Fetch mode -- Specify either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
- Only Queried Datasets -- Dremio updates details for previously queried objects in a source.
This mode increases query performance because less work is needed at query time for these datasets. - All Datasets -- Dremio updates details for all datasets in a source. This mode increases query performance because less work is needed at query time.
- As Needed -- Dremio updates details for a dataset at query time. This mode minimized metadata queries on a source when not used, but might lead to longer planning times.
- Only Queried Datasets -- Dremio updates details for previously queried objects in a source.
- Fetch every -- Specify fetch time based on minutes, hours, days, or weeks. Default: 1 hour
- Expire after -- Specify expiration time based on minutes, hours, days, or weeks. Default: 3 hours
- Fetch mode -- Specify either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
- Authorization -- When impersonation is enabled, the maximum amount of time that Dremio will cache authorization information.
Privileges
On the Privileges tab, you can grant privileges to specific users or roles. See Access Controls for additional information about privileges.
All privileges are optional.
- For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
- For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.
Updating a MapR-FS Source
To update a MapR-FS source:
- On the Datasets page, under Object Storage in the panel on the left, find the name of the source you want to edit.
- Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the at the top right corner of the page.
- In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name. For information about the settings options, see Configuring MapR-FS as a Source.
- Click Save.
Deleting a MapR-FS Source
If the source is in a bad state (for example, Dremio cannot authenticate to the source or the source is otherwise unavailable), only users who belong to the ADMIN role can delete the source.
To delete a MapR-FS source, perform these steps:
- On the Datasets page, click Sources > Object Storage in the panel on the left.
- In the list of data sources, hover over the name of the source you want to remove and right-click.
- From the list of actions, click Delete.
- In the Delete Source dialog, click Delete to confirm that you want to remove the source.
Deleting a source causes all downstream views that depend on objects in the source to break.