This topic specifies how to configure Dremio for distributed storage.
To configure distributed storage, the paths.dist
property in the dremio.conf file
must specify the cache location where Dremio holds the accelerator, table, job result, download, and upload data.
If this property is updated, then it must be updated in the dremio.conf file on all nodes.
By default, Dremio uses the disk space on local Dremio nodes.
Store | Supported? |
---|---|
NAS | Yes |
HDFS | Yes |
MapR-FS | Yes |
Amazon S3 | Yes |
Azure Data Lake Store (ADLS Gen1) | Yes |
Azure Storage (ADLS Gen2) | Yes |
NAS (network attached storage) is a device that serves files via a network using a protocol such as NFS. Dremio supports the NFS protocol with NAS.
NFS protocol type | Supported? |
---|---|
Netapp | Yes |
MapR NFS shares | No |
Window file shares | No |
Before Configuring
This information is applicable to Dremio 3.1.x and earlier.
Before configuring NAS for Dremio, mount your NFS share with theacdirmin=0,acdirmax=0
options. These options provide faster response time and avoid timeouts on results being loaded.For example:
mount -t nfs -o acdirmin=0,acdirmax=0 172.28.1.8:/var/nfs /var/nfs
To configure NAS as Dremio’s distributed storage, add the distributed path to the dremio.conf file:
paths: {
...
dist: "file:///shared_mount_path"
}
Before configuring HDFS as Dremio’s distributed storage, test adding the same cluster as a Dremio source and verify the connection.
The following are dremio.conf file changes:
paths: {
...
dist: "hdfs://<NAMENODE_HOST>:8020/path"}
When deploying on Hadoop using YARN, Dremio automatically copies this option to all nodes. So it only needs to be configured manually on Coordinator nodes.
Name Node HA
If Name Node HA is enabled, when specifying distributed storage (paths.dist
in dremio.conf), path should be
specific using fs.defaultFS
value instead of the active name node. (e.g. <value_for_fs_defaultFS>
/path)
fs.defaultFS
value can be found in core-site.xml (typically found under /etc/hadoop/conf).
As per Hadoop using YARN deployment guide, ensure that you’ve copied core-site.xml, hdfs-site.xml, and yarn-site.xml (typically under /etc/hadoop/conf) files into Dremio’s conf directory.
Before configuring MapR-FS as Dremio’s distributed storage, test adding the same cluster as a Dremio source and verify the connection.
The following are dremio.conf file changes:
paths: {
...
dist: "maprfs:///<MOUNT_PATH>/<CACHE_DIRECTORY>"
}
When deploying on MapR using YARN, Dremio automatically copies this option to all nodes. So it only needs to be configured manually on Coordinator nodes.
Before configuring Amazon S3 as Dremio’s distributed storage:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::BUCKET-NAME",
"arn:aws:s3:::BUCKET-NAME/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
"s3:HeadBucket",
"s3:GetBucketLocation"
],
"Resource": "*"
}
]
}
To configure Dremio for Amazon S3:
Change the distributed property in the dremio.conf file. Note that you must create the storage root directory first.
paths: {
...
dist: "dremioS3:///<bucket_name>/<folder1>/<folder2>"
}
For example: dist: "dremioS3:///qa1.dremio.com/jduong/accel"
Create core-site.xml and include IAM credentials with list, read and write permissions:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.dremioS3.impl</name>
<description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
<value>com.dremio.plugins.s3.store.S3FileSystem</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID.</description>
<value></value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key.</description>
<value></value>
</property>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<description>The credential provider type.</description>
<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
</property>
</configuration>
You can use instance profile too for configuring the distributed storage in place of access key and secret key:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.dremioS3.impl</name>
<description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
<value>com.dremio.plugins.s3.store.S3FileSystem</value>
</property>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<description>The credential provider type.</description>
<value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
</property>
</configuration>
Copy core-site.xml to under Dremio’s configuration directory (same as dremio.conf) on all nodes.
As of Dremio 3.2.3, Minio is can be used as a distributed store for both unencrypted and SSL/TLS connections. However, this feature is experimental and is not suitable for production environments.
To configure Minio as a distributed store:
Ensure that the provided root directory(bucket) is already created in Minio server.
Change the distributed property in the dremio.conf file. Note that you must create the storage root directory first.
paths: {
...
dist: "dremioS3:///<bucket_name>/<folder1>/<folder2>"
}
For example: dist: "dremioS3:///dremio"
Create core-site.xml and include IAM credentials with list, read and write permissions:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.dremioS3.impl</name>
<description>The FileSystem implementation. Must be set to com.dremio.plugins.s3.store.S3FileSystem</description>
<value>com.dremio.plugins.s3.store.S3FileSystem</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<description>Minio server access key ID.</description>
<value>ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>Minio server secret key.</description>
<value>SECRET_KEY</value>
</property>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<description>The credential provider type.</description>
<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<description>Endpoint can either be an IP or a hostname, where Minio server is running . However the endpoint value cannot contain the http(s) prefix. E.g. 175.1.2.3:9000 is a valid endpoint. </description>
<value>ENDPOINT</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<description>Value has to be set to true.</description>
<value>true</value>
</property>
<property>
<name>dremio.s3.compat</name>
<description>Value has to be set to true.</description>
<value>true</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<description>Value can either be true or false, set to true to use SSL with a secure Minio server.</description>
<value>SSL_ENABLED</value>
</property>
</configuration>
Copy core-site.xml to under Dremio’s configuration directory (same as dremio.conf) on all nodes.
Troubleshooting Minio
The default directory is /tmp/hadoop-dremio/s3a, if you do not have enough space, you may encounter error like “DiskErrorException: No space available in any of the local directories.” As an alternative option, in the core-site.xml file, add or change the
fs.s3a.buffer.dir
setting to a directory of your choice (any directory with ample space and write permissions for Dremio). This should resolve writing to the default /tmp directory.Note: Until you restart the executors, you’ll see that directory is not used for buffering.
To configure Dremio to use AWS Key Managment Store (KMS) encryption for Amazon S3:
KSM Encryption is available as of Dremio 4.0 Enterprise Edition
Change the distributed property in the dremio.conf file. Note that you must create the storage root directory first.
paths: {
...
dist: "dremioS3:///<bucket_name>/<folder1>/<folder2>"
}
For example: dist: "dremioS3:///qa1.dremio.com/jduong/accel"
Modify core-site.xml to include the following property:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>SSE-KMS</value>
</property>
<property>
<name>fs.s3a.server-side-encryption.key</name>
<value>KEY_ARN</value>
</property>
</configuration>
Note: Obtain the server-side encryption key from AWS, KMS > Customer managed keys > Create key.
Copy core-site.xml to under Dremio’s configuration directory (same as dremio.conf) on all nodes.
The following exception may occur under the following circumstances:
Exception
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.SharedInstanceProfileCredentialsProvider not found
To resolve this issue, edit the core-site.xml file and update the fs.s3a.aws.credentials.provider property:
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
</property>
This property change is required since Hadoop 3 removed the SharedInstanceCredentialsProvider class.
Before configuring Azure Data Lake Storage Gen1 as Dremio’s distributed storage:
To set up configuration for distributed storage:
paths: {
...
dist: "dremioAdl://<DATA_LAKE_STORE_NAME>.azuredatalakestore.net/<STORAGE_ROOT_DIRECTORY>"
}
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.dremioAdl.impl</name>
<description>Must be set to com.dremio.plugins.adl.store.DremioAdlFileSystem</description>
<value>com.dremio.plugins.adl.store.DremioAdlFileSystem</value>
</property>
<property>
<name>dfs.adls.oauth2.client.id</name>
<description>Application ID of the registered application under Azure Active Directory</description>
<value>APPLICATION_ID</value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<description>Generated password value for the registered application</description>
<value>PASSWORD</value>
</property>
<property>
<name>dfs.adls.oauth2.refresh.url</name>
<description>Azure Active Directory OAuth 2.0 Token Endpoint for registered applications.</description>
<value>OATH2_ENDPOINT</value>
</property>
<property>
<name>dfs.adls.oauth2.access.token.provider.type </name>
<description>Must be set to ClientCredential</description>
<value>ClientCredential</value>
</property>
<property>
<name>fs.dremioAdl.impl.disable.cache</name>
<description>Only include this property AFTER validating the ADLS connection.</description>
<value>false</value>
</property>
</configuration>
The Azure Storage is the foundation for the ADLS Gen2 service. See Azure Storage for more information.
To set up configuration for distributed storage:
paths: {
...
dist: "dremioAzureStorage://:///<FILE_SYSTEM_NAME>/<ALTERNATIVE_STORAGE_ROOT_DIRECTORY>"
}
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.dremioAzureStorage.impl</name>
<description>FileSystem implementation. Must always be com.dremio.plugins.azure.AzureStorageFileSystem</description>
<value>com.dremio.plugins.azure.AzureStorageFileSystem</value>
</property>
<property>
<name>dremio.azure.account</name>
<description>The name of the storage account.</description>
<value>ACCOUNT_NAME</value>
</property>
<property>
<name>dremio.azure.key</name>
<description>The shared access key for the storage account.</description>
<value>ACCESS_KEY</value>
</property>
<property>
<name>dremio.azure.mode</name>
<description>The storage account type. Value: STORAGE_V2</description>
<value>STORAGE_V2</value>
</property>
<property>
<name>dremio.azure.secure</name>
<description>Boolean option to enable SSL connections. Default: True Value: True/False</description>
<value>True</value>
</property>
</configuration>
To enable distributed storage with OAuth 2.0, update the core-site.xml file. See the following sample information for reference:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.dremioAzureStorage.impl</name>
<description>FileSystem implementation. Must always be com.dremio.plugins.azure.AzureStorageFileSystem</description>
<value>com.dremio.plugins.azure.AzureStorageFileSystem</value>
</property>
<property>
<name>dremio.azure.account</name>
<description>The name of the storage account.</description>
<value>ACCOUNT_NAME</value>
</property>
<property>
<name>dremio.azure.mode</name>
<description>The storage account type. Value: STORAGE_V1 or STORAGE_V2</description>
<value>MODE</value>
</property>
<property>
<name>dremio.azure.secure</name>
<description>Boolean option to enable SSL connections. Default: True, Value: True/False</description>
<value>SECURE</value>
</property>
<property>
<name>dremio.azure.credentialsType</name>
<description>The credentials used for authentication. Value: ACCESS_KEY or AZURE_ACTIVE_DIRECTORY</description>
<value>CREDENTIALS_TYPE</value>
</property>
<property>
<name>dremio.azure.clientId</name>
<description>The client ID of the Azure application used for Azure Active Directory</description>
<value>CLIENT_ID</value>
</property>
<property>
<name>dremio.azure.tokenEndpoint</name>
<description>OAuth 2.0 token endpoint for Azure Active Directory(v1.0)</description>
<value>TOKEN_ENDPOINT</value>
</property>
<property>
<name>dremio.azure.clientSecret</name>
<description>The client secret of the Azure application used for Azure Active Directory</description>
<value>CLIENT_SECRET</value>
</property>
</configuration>
To use OAuth 2.0 authentication with Azure Government cloud platform, add the following property to the core-site.xml:
<property>
<name>fs.azure.endpoint</name>
<description>Azure Government Cloud Endpoint</description>
<value>GOVERNMENT_CLOUD_ENDPOINT</value>
</property>
To configure the Azure Storage data source to access data on the Azure Government cloud platform,
add the fs.azure.endpoint
property to the core-site.xml file along with the general Azure Storage properties and
copy the core-site.xml file to the Dremio’s configuration directory location on all nodes..
For Storage V2:
<property>
<name>fs.azure.endpoint</name>
<description>The azure storage endpoint to use.</description>
<value>dfs.core.usgovcloudapi.net</value>
</property>
For Storage V1:
<property>
<name>fs.azure.endpoint</name>
<description>The azure storage endpoint to use.</description>
<value>blob.core.usgovcloudapi.net</value>
</property>
This configuration is done in addition to the configuration done via the UI. See Azure Storage for more information.