Elasticsearch

Dremio and Elasticsearch

Compatibility

Version support:

  • Elasticsearch 5.x and 6.x standalone
  • Amazon Elasticsearch Service (version 5.x, 6.0, 6.2, and 6.3).

Pushdown scripting support: Painless

Metadata Concepts

In order to plan and execute queries, Dremio captures and stores Elastic metadata in Dremio’s internal metadata database to efficiently plan and execute queries. This captured metadata is broken into two broad categories:

  • Dataset Discovery: Names of available Indices, Mappings and Aliases. This information is required to expose databases and tables in the Dremio UI and BI tool connections
  • Dataset Details: Complete information including definition of mapping, sampled schema and shard locations. This information is required to complete a query against a particular table.

Dremio will interact with the /_cluster/state/metadata api to understand the nature of the objects inside your Elasticsearch install. From this API endpoint, Dremio can learn metadata about each of these object type. By default, Dataset Discovery has an hourly refresh interval. Additionally, Dataset Details has an hourly refresh interval for Elastic tables that have been queried at least once.

Accessing Objects

The Dremio Elastic Connector is designed to provide a consistent and understandable view of Elastic Indices and Mappings, through the use of a two level hierarchy. In Dremio, these two levels can be thought of as database and table. Elastic Indices and Aliases are exposed as databases and each mapping within those index or alias is exposed as a table.

Dremio also supports exposing data inside Elastic aliases. In Dremio, aliases and indices are not visually distinguished and a user can easily interact with either entity. Additionally, Dremio understands filtered aliases and will correctly apply those filters as part of its operations.

Dremio also allows users access to Elastic’s capability to expose synthetic tables through wildcards and comma separated lists. A user can use wildcards in both the name of the database (index) or the name of the table (mapping). This is done by modifying the from clause in a standard SQL query. Once that query is executed, if Elastic recognizes the name, those entities will show up in the product as additional datasets available for query and access (and will be maintained and secured like any other table). If you want to have Dremio forget about those entities, an administrator can use ALTER TABLE <TABLE> FORGET METADATA to remove those synthetic entities.

SELECT * from elastic."feb*"."big"
SELECT * from elastic."feb,march"."big"
SELECT * from elastic."feb"."big,small"

Execution Metadata

When Dremio executes queries against Elastic, it usually parallelizes the query to interact with each shard in your Elastic cluster to move data as quickly as possible back to Dremio. Dremio does this by probing the /<indexOrAlias>/_search_shards API.

List Promotion Rules

Elastic does not distinguish between scalar and list of scalars but Dremio does. In order to ensure the best possible user experience, Dremio uses the schema analysis phases outlined above to expose the final user schema. To simplify things, once Dremio detects at least one field with a list of scalars, it exposes all records for that field as a list of scalars. This allows users to avoid having to deal with union types. An example:

  • Elastic mapping is defined as field ‘A’ and type integer.
  • Records 1-4 exist and each have a single integer for field ‘A’.
  • Dremio samples the schema and exposes field ‘A’ as a scalar.
  • Record five is inserted into the index
  • Dremio now exposes field ‘A’ as an int[] for all records 1-5.

Dremio does this promotion both at initial sampling time and during execution. If during execution Dremio discovers a value for a field that is of scalar type is actually a list type, Dremio will learn this schema and re-execute the query.

Special Handling for Geoshape

Geoshape is a special type in the Elastic ecosystem. This is because has a different schema depending on which type is exposed. Despite this, they are all represented at the type system level as a single type. In this situation, Dremio exposes the Geoshape type and specifically its potential coordinates fields as a group of union fields supporting from 1 to 4-dimensional double arrays to reflect the various types of Elastic geoshapes.

Mapping Consistency and Schema Learning

In some cases, it is possible that Dremio will query an index and find a schema change that was previously unknown to Dremio (different type for field or new field). In both cases, Dremio will do a two step verification process to correctly learn the new schema. Dremio maintains a mapping checksum for all identified schemas. When it encounters an unexpected change, it will first verify that the canonical schema from Elastic is consistent with Dremio’s previously known mapping. If it is, Dremio will follow its standard promotion rules. If it is not, Dremio will halt execution and request the user to use the ALTER TABLE <TABLE> REFRESH METADATA
operation to have Dremio immediately reread the updated Elastic mapping information. Note, this is an optional step as the mapping will also be updated on the schedule defined for automated metadata updates.

Discovery of New Fields

As part of the Dataset Details refresh, Dremio will automatically reload all Elastic mappings to learn about any new fields. Each time this happens, Dremio will resample and update its understanding of schema.

Mapping Merging

If you compose a query that includes multiple mappings, Dremio will do its best to merge those mappings. Mappings are merged on a field by field basis. Mappings can be merged if at least one of the following is true:

  1. Fields with overlapping positions are the same type (mapping1.a::int and mapping2.a::int)
  2. Fields are in non-overlapping positions (mapping1.a::int versus mapping2.b::float)

When Dremio merges a mapping, it does so linearly, inheriting the initial field order based on the first index queried.

Elastic Pushdowns

Dremio supports multiple types of pushdowns for different Elastic version and configuration combinations including:

  • Predicate (e.g. x < 5) pushdowns using Elastic queries
  • Lucene search queries using the CONTAINS syntax (starting from 5.3.x)
  • Optional source field/inclusion exclusion (disabled for performance reasons but can be enabled if Dremio has a slow connection to Elastic nodes).
  • Group by pushdowns for grouping by strings, dates, times, timestamps, integer, longs, doubles, floats, booleans using the Elastic Term Aggregation capabilities
  • Aggregate Measure pushdown including COUNT, COUNT(DISTINCT), SUM, AVG, STDDEV, VAR using Elastic aggregation framework.
  • Support for converting many arbitrary expressions and ~50 common functions through the use of Groovy (ES2) or Painless (ES5+) scripts for use in both filter and aggregate expressions.

Expression and Function Pushdowns

Dremio supports pushing down the following expressions and functions:

Type Expression/Function
Comparison Equals
Comparison Not equals
Comparison Greater than
Comparison Greater or equal to
Comparison Less than
Comparison Less or equal to
Comparison LIKE
Comparison ILIKE
Boolean NOT
Boolean OR
Boolean AND
NULL Check IS NULL
NULL Check IS NOT NULL
Flow CASE
Type Conversion CAST
String CHAR LENGTH
String UPPER
String LOWER
String TRIM
String CONCAT
Numeric Add
Numeric Subtract
Numeric Multiply
Numeric Divide
Numeric POWER
Numeric MOD
Numeric ABS
Numeric EXP
Numeric FLOOR
Numeric CEIL
Numeric LOG
Numeric LOG10
Numeric SQRT
Numeric SIGN
Numeric COT
Numeric ACOS
Numeric ASIN
Numeric ATAN
Numeric DEGREES
Numeric RADIANS
Numeric SIN
Numeric COS
Numeric TAN

How Dremio Decides What To Pushdown

Dremio works hard to pushdown as many operations as possible to Elastic to try to provide the highest performance experience. Dremio is also focused on maintaining a consistent SQL experience for users who may not understand Elastic or its APIs. As such, Dremio is very focused on providing a correct SQL experience. This includes respecting null semantics through the use of missing aggregation, expression evaluation consistency, correct aggregation semantics on analyzed fields, etc. Dremio also works well with Groovy and Painless to pushdown many more types of operations. It will work without scripts enabled but it is strongly recommended to enable scripts.

Given the nature of Elastic’s API, Dremio utilizes the following pieces of functionality to provide a SQL experience: Bucket Aggregations, Pipeline Aggregations, Filter Aggregations and searches using Elastic Query DSL.

Script Construction

Dremio builds custom Groovy (ES2) or Painless (ES5) scripts to interact with Elastic. Because of the many small differences in these languages (type handling, dynamic dispatch, type coercion, function signatures, primitive handling, etc), these scripts are different for each version of Elasticsearch. These scripts utilize Elastic’s doc values columnar capability where possible but also rely on _source fields for certain operations (e.g. aggregations on analyzed fields for example). As Dremio analyzes a user’s SQL expression, it decomposes the expression into a script that can be understood by Elastic’s scripting capability.

There are many situations where Dremio uses an expression that might at first be unexpected. These are because of the nature of some of Elastic apis. Some examples behaviors that Dremio does to ensure correct results:

  • Dremio uses _source fields for accessing IP addresses when aggregating or filtering in ES2 because the type has changed between ES2 and ES5
  • Dremio doesn’t push down multi-index complex expressions (table1.a[2].b[3].c[4]) using doc values because doc values can only reference leaf fields and leaf arrays
  • Dremio doesn’t do any array dereferencing using _source fields because they are not canonicalized to the Elastic mapping. This means that nested arrays [1,[2,3]] haven’t been flattened to the Elastic canonical representation [1,2,3]. This is done as otherwise scripts would produce wrong result.
  • Dremio won’t use a doc field reference for a field that has it implicitly disabled (string/text) or explicitly disabled (doc_values: false).
  • Dremio won’t use doc fields for GeoShapes. This is because Dremio doesn’t expose a first class shape objects and the fields exposed in Dremio (lists of arrays of doubles) are not directly related to Elastic’s internal representation or query capabilities.
  • Dremio won’t pushdown operations against nested fields. This is because nested fields are stored out of line of the core document (not in the original document’s doc values) and have semantics inconsistent with traditional SQL aggregation. (Dremio is exploring future work to expose this through enhancements to the language.) Note that Dremio also doesn’t use _source field scripts to interact with nested documents because they are exposed as arrays of values and suffer from the canonicalization issue described above.

Debugging and Logging

If you want to better understand how Dremio is interacting with your Elastic cluster, you can enable Dremio Elastic logging on each Dremio node. This will record each response and request to the Elastic cluster, including a portion of each message body.

You can do this by adding the following configuration to your conf/logback.xml file on all nodes:

  <appender name="elasticoutput" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>${dremio.log.path}/elastic.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <fileNamePattern>${dremio.log.path}/archive/elastic.%d{yyyy-MM-dd}.log.gz
      </fileNamePattern>
      <maxHistory>30</maxHistory>
    </rollingPolicy>
    <encoder>
      <pattern>%date{ISO8601} [%thread] %-5level %logger{36} - %msg%n
      </pattern>
    </encoder>
  </appender>
  <logger name="elastic.requests" additivity="false">
    <level value="info"/>
    <appender-ref ref="elasticoutput"/>
  </logger>

Working with Elasticsearch and x-pack

If your Elasticsearch source uses Shield, then your Elasticsearch user account must have the 'monitor' privilege at the cluster level (an admin user has this by default). In addition, for each index you want to query upon, your user account need to have the 'read' and 'view_index_metadata' privilleges as well. Both privilleges are included in 'all'.

Here is an example to set up a role 'dremio' with necessary privilleges to access 'test_schema_1' index:

POST /_xpack/security/role/dremio

{
  "cluster": [ "monitor" ],
  "indices": [
    {
      "names": [ "test_schema_1" ],
      "privileges": [ "read", "view_index_metadata" ]
    }
  ]
}

Working with Elasticsearch and Shield

If your Elasticsearch source uses Shield, then your Elasticsearch user account must have the 'monitor' privilege at the cluster level (an admin user has this by default). If your account lacks the 'monitor' privilege, and you don't have access to an admin user, you can create a new account with 'monitor' by following these steps:

  • Log in to a search node, go the Elasticsearch install's home directory, and open the file ./config/shield located inside.
  • Append this text, which gives monitor privileges to an Elasticsearch index called books for any user with the dremio_user role:
dremio_user:
cluster:
- cluster:monitor/nodes/info
- cluster:monitor/state
- cluster:monitor/health
indices:
'books' :
- read
- indices:monitor/stats
- indices:admin/get
- indices:admin/mappings/get
- indices:admin/shards/search_shards
  • Run this command, adding a new user to Shield that has the 'dremio_user' role:

    ./bin/shield/esusers useradd <username> -r dremio_user
    
  • Copy the Shield config file you edited to every other node in the Elasticsearch cluster:

    scp -r ./config/shield root@<other-es-node>:<elastic-install-dir>/config
    

Dremio Configuration

Here are all available source specific options:

Name Description
Hosts A list of Elasticsearch hosts.
Authentication Type No authentication or Master Authentication if Shield or other security is enabled.
Username Elasticsearch user name.
Password Elasticsearch password.
Scripts enabled in Elasticsearch cluster Dremio will not use script pushdowns if this option is disabled.
Show Hidden Indices Show/hide indices that start with a dot.
Use Painless Scripting Whether to use the Painless scripting language when connecting to Elasticsearch 5.0+ (experimental), optional.
Query whitelisted hosts only Use this option when connecting to managed Elasticsearch instances or when Dremio only has access to the hosts specified.
Show _id column Whether to show the ES ID column, optional.
Read Timeout Read timeout in seconds.
Scroll Timeout Scroll timeout in seconds.
Scroll Size Configure scroll size for Elasticsearch requests Dremio makes. This setting must be less than or equal to your Elasticsearch's setting for index.max_result_window setting (typically defaults to 10,000).
SSL enabled Whether to use secured connections, optional.
Use index on analyzed and normalized fields Use this option to enable pushdown filters and aggregations on analyzed text fields and normalized keyword fields. Depending on the config for analyzer / normalizer used for the field, queries may return unexpected results. For example, if lowercase analyzer is used, then WHERE city = 'Seattle' will match both seattle and Seattle.

results matching ""

    No results matching ""