Skip to main content

Pillar 3: Cost Optimization

Although it's important to get the best performance possible with Dremio Cloud, it's also important to optimize costs associated with managing the Dremio Cloud platform.

Principles

Minimize Running Executor Nodes

Dremio Cloud can scale to many hundreds of nodes, but any given engine should have only as many nodes as are required to satisfy the current load and meet service-level agreements.

Dynamically Scale Executor Nodes Up and Down

When running Dremio Cloud engines, designers can leverage concurrency per replica and minimum and maximum number of replicas to dynamically expand and contract capacity based on load.

Eliminate Unnecessary Data Processing

As described in the best practices for Pillar 2: Performance Efficiency, creating too many reflections, especially those that perform similar work to other reflections or provide little added benefit in terms of query performance, can incur unnecessary costs because reflections need system resources to rebuild. For this reason, consider removing any unnecessary reflections.

To avoid the need to process data that is not required for a query to succeed, use filters that can be pushed down to the source wherever possible. Enabling partitioning on source data that are in line with the filters also helps speed up data retrieval.

Also, optimize source data files by merging smaller files or splitting larger files whenever possible.

Best Practices

Size Engines to the Minimum Replicas Required

To avoid accruing unnecessary cost, reduce the number of active replicas in your engines to the minimum (typically 1, but 0 when the engine is not in use on weekends or non-business hours). A minimum replica count of 0 delays the first query of the day due to engine startup, which you can mitigate with an external script that executes a dummy SQL statement prior to normal daily use.

Remove Unused Reflections

Analyze the results in Dremio Cloud sys.project.jobs_recent system table along with the results for the system tables sys.project.reflections and sys.project.materializations to get information about the frequency at which each reflection present in Dremio Cloud is leveraged. You can further analyze reflections that are not being leveraged to determine if any are still being refreshed, and if they are, how many times they have been refreshed in the reporting period and how many hours of cluster execution time they have been consuming.

Checking for and removing unused reflections is good practice because it can reduce clutter in the reflection configuration and often free up many hours of cluster execution cycles that can be used for more critical workloads.

Optimize Metadata Refresh Frequency

Ensure metadata-refresh frequencies are set appropriately based on what you know about the frequency that metadata is changing in the data source.

The default metadata refresh frequency set against data sources is once per hour, which is too frequent for many data sources. For example, if data in the sources are only updated once every 6 hours, it is not necessary to refresh the data sets every hour. Instead, change the refresh schedule to every 6 hours in the data source settings.

Furthermore, because metadata refreshes can be scheduled at the data source level, overridden at each individual table level, and performed programmatically, it makes sense to review each new data source to determine the most appropriate setting for it. For example, for data lake sources, you might set a long metadata refresh schedule such as 3000 weeks so that the scheduled refresh is very unlikely to fire, and then perform the refresh programmatically as part of the extract, transform, and load (ETL) process, where you know when the data generation has completed. You might set relational data sources to refresh every few days, but then override the source-level setting for tables that change more frequently.

When datasets are updated as part of overnight ETL runs, it doesn’t make sense to refresh the dataset metadata until you know the ETL process is finished. In this case, you can create a script that triggers the manual refresh of each dataset in the ETL process after you know the dataset ETL is complete.

For data sources that contain a large number of datasets but few datasets that change their structure or have new files added, it makes little sense to refresh at the source level on a fixed schedule. Instead, set the metadata to a long source-level refresh timeframe like 52 weeks and use scripts to trigger a manual refresh against a specific dataset.

If you set the metadata refresh schedule for a long timeframe and you do not have any scripting mechanism to refresh your metadata, when a query runs and the planner notices that the metadata is stale or invalid, Dremio Cloud performs an inline metadata refresh during the query planning phase. This can have a negative impact on the duration of query execution because it also incorporates that metadata refresh duration.