Skip to main content
Version: current [24.3.x]

Data Curation

Describing Data

Dremio provides the following tools for describing, identifying, and displaying data:

  • Wikis
  • Tags
  • Catalogs

Wikis

The Wiki allows you to add a description for a Space (and its datasets) or a Source (and its datasets). The Wiki's formatting language is a Github-flavored markdown and supported by a rich text editor.

This image shows an example of the Wiki editor in Dremio.

Tags

Tags allow you to create and assign tags to all datasets. You can search for items based on a tag and click on a tag to start a search based on that specific tag. The following images shows a dataset in a space with several tags. In this example, "best tag" was selected and is shown in the search field.

This image shows an example of creating tags.

Catalog tab

The Catalog tab displays Wiki content, Tags, and Fields for all datasets.

Transforming Data

To begin a data transformation, via one of the following methods:

  • Highlighting a portion or all of a field's value
  • Using the dropdown menu for the transformation. The dropdown is to the right of the field's name.

Using Highlighting

The highlighting method is often the most intuitive method.
It provides enough context for Dremio to make some best guesses about how to execute the transformation that you have in mind.

For instance, you could highlight a portion of a field that contains customer names to quickly perform an extract that creates a new field with only last names.

Suggestion Cards
For transformations that are initiated by highlighting part or all of a field value, Dremio uses a heuristic to determine a set of "suggestion cards" that represent its best guesses as to your intended result.

You can click on one of these suggestion cards to inspect a preview of the new dataset and confirm that it matches your expectations. If no suggestion card is a perfect match, you can "flip" the card (by clicking the pencil icon in the upper right corner) to tweak the card's parameters before applying the transformation.

note

The highlight method is great for beginning an extract. However, in cases where other capabilities are required, the dropdown menu may be more useful.

Using Dropdown Menus

The dropdown menu provides a more complete list of transformations that are applicable to the data type. These include Unnest for complex list type data (see Dealing with Complex Types) and Convert Case for text type data (see Cleaning Text).

Fixing Inconsistent Data with a Join

In situations where the entries in a field are inconsistent (for example, different spellings or abbreviations for the same name), the following technique can be used to increase the quality of the dataset:

  1. Identify the field with the problematic data. It may be useful to run this command in the SQL Editor:

    Identify a field
    SELECT DISTINCT myProblemFieldName FROM myDatasource.myTable
  2. Download the results as CSV using the Download button.

  3. Open the file in a text editor or Excel and create lookup values for the distinct values from your table in a second column. For example, standardizing variations in color names to a single canonical name.

  4. Upload this file to your Home space on Dremio

  5. Open this new dataset and hit the Join button located on the left above the field names

  6. Select Custom Join and then the name of the inconsistent dataset you would like to fix, followed by Next

  7. Drag over the name of the left column from your uploaded dataset, and match it with the name of the field you would like to correct in the inconsistent dataset

  8. Apply the Join then drop the old field, renaming the new one to take its place

  9. Save the corrected dataset

Dealing with Mixed Types

note

Mixed types in columns for FileSystem sources are not supported in Dremio 18.0+. Instead, a standard schema is now enforced. See Mixed Data Types in FileSystem Sources for additional information.

Fields in datasets that contain two or more kinds of data (for instance text and numeric) are said to have a mixed type, and are indicated in Dremio by the mixed type symbol. You should resolve mixed type fields before performing any analyses that rely on them.

To begin cleaning up a mixed type field click the mixed type icon to the left of the field name. This provides a dialog that presents two options for resolving the field:

  • Convert to Single Data Type Migrates the field values by attempting to convert them all to a preferred type. If conversion is impossible you can choose to either delete that record, or replace the value.

  • Split by Data Type Solves the problem of mixed types by creating a new field for each type found within the mixed field.

In this example we resolve a mixed field with both text and numeric types by replacing the records that contain the text 'current' with the integer 2016:

Mixed Data Types in FileSystem Sources

For datasets in FileSystem sources with columns that contain mixed data types, this is no longer supported as of Dremio 18.0+. As a result, these mixed data types will be combined where compatible. Prior to upgrading to 18.0, we recommend that you contact Dremio Support to utilize a tool that will identify instances of mixed types which might result in issues post-upgrade.

note

If data types are not compatible, after upgrading to Dremio 18.0+ the dataset will be unreadable until this is resolved.

The following sections identify when similar data types will be coerced and when errors will result.

Decimal-to-Decimal Coercion

If the file’s Decimal type is different from the table’s Decimal type (such as due to schema learning or internal schema), the file’s type is implicitly coerced to the table’s Decimal definition when read by Dremio.

For Hive sources:

  • On truncation, truncate
  • On overflow, return NULL

For non-Hive sources:

  • On truncation, truncate
  • On overflow, return error with overflow
Relational-Database Sources

There is no coercion due to mixed types. If a relational-database source supports a conversion, then the task of converting is delegated to that source; otherwise, Dremio does the conversion.

If, when Dremio plans a query, a conflict is found between the data type of a column in a source table and the data type of that same column in the corresponding Dremio table, a CAST is added to the query, the processing of which is delegated to the source. Therefore, in the query results that are returned to Dremio, the data in the column uses the data type defined in the Dremio table.

This process differs from the one followed for file-based sources, where the data is always read before being coerced by Dremio.

MongoDB

The Mongo plugin reads columns as union types if it encounters mixed types. When it encounters a field of type DECIMAL where the precision and scales are different, it will change the scale of the field (it uses precision 38 by default) and resample the data.

The Mongo plugin also handles coercion on filters in very specific cases:

  • VARCHAR to number types
  • DATE/TIME/TIMESTAMP to an ISODate (a TIMESTAMP essentially)
  • VARCHAR to an ISODate

Dealing with Complex Types

This section describes the use of maps and lists to handle complex types.

Transforming Maps

Map type complex fields are separated from a value with a particular path by using the Extract transformation.

The easiest way to apply this transformation is to

  1. Expand a map with the ellipsis button.
  2. Click the value of interest.
  3. Select Extract from the dropdown menu.
  4. Create a field based on the path of the selected value.
note

Dremio allows access to nested fields using SQL. This requires aliasing the dataset that contains the nested data, for example:

select
t1.property_map.field1
from source.mytable t1

Transforming Lists

List type complex data is handled with the Unnest transformation, which creates a new record for each element in the list. Each additional record created by Unnest maintains the same values as the original list for all other fields.

You can also use Extract on lists to separate one value or a range of values into a new field. If you extract a range of values the resulting new field will also be a list type.

In this example we use Unnest on a field that contains a list of genres for each title in a database of books:

  • Before Unnest Transformation

  • After Unnest Transformation

Cleaning Text

For text data, excess whitespace and changing capitalization schemes are two common data cleanliness issues. Dremio provides two transformations for dealing with these possible inconsistencies: Trim Whitespace and Convert Case.

In this example, a text field with a mixed capitalization scheme is transformed into consistency:

Handling Invalid, Empty and NULL Values

Fields with invalid values (such as the text 'N/A' in an otherwise numeric column) are marked with an orange type icon as a 'mixed type' field. These can be handled by clicking the orange mixed type icon using the dialog that appears to clean up the data.
See Dealing with Mixed Types for more information.

Empty or NULL text values are best eliminated by using Exclude. You can initiate this transformation by:

  1. highlighting a value from the field that contains empty or NULL values, and

  2. selecting Exclude from the dropdown that appears. This renders a list of the values in this field, and the frequency at which they occur.

  3. Check the boxes next to the empty and/or NULL values you which to exclude from the dataset and click Apply.

In the following example, we remove blank values from a field that holds state names:

Working with Dates

In Dremio you can convert a text type field that contains date information into a proper date type field. This allows you to do more sophisticated analyses in external tools such as spotting by trends by month, year, or quarter.

You can begin this conversion by selecting 'Date & Time' from the type menu located to the left of the text field's name. In the subsequent dialog, Dremio provides a dropdown for selecting whether the output should be a time, date, date and time. It also gives a few default options for the format as well as a 'Custom' field for indicating a custom format.

Dremio supports the following formatting elements:

Format ElementDescriptionExample Output
AD/BCEra indicatorAD, BC
AM/PMMeridian indicatorAM, PM
CCCentury indicator (0-99)19
WWWeek of year (0-52)4, 43
DDay of week (1-7)6
DYAbbreviated day name of weekTue, Fri
DAYFull day name of weekTuesday, Friday
YYYYFour digits of year1996
YYLast two digits of year96
DDDDay of year (1-366)5, 245
MMMonth (1-12)8
MONAbbreviated month nameMar, Oct
MONTHFull month nameMarch, October
DDDay of month (1-31)24
HH/HH12Hour of day (1-12)4
HH24Hour of day (0-23)21
MIMinutes (0-59)22
SSSeconds (0-59)54
FFFMilliseconds121
TZDTimezone abbreviationUTC, PST
TZOTimezone offset+02:00, -0800
note
  • Only the following characters are allowed in a format pattern: - / , . ; :.
  • If you need to let a value pass through to the output unmodified, you can surround it with double quotes " (for example, "T").
  • If you convert a date to text, numeric values are zero padded for you.
    For example, MM returns 04 for April.

Copying and Downloading Result Sets

You can copy the result sets from the display window to a clipboard or Notepad. This is accomplished via a button on the results table that copies the data from the query to the clipboard. The maximum number of records copied is 5,000.

You can download your result sets. Downloaded jobs run much faster as they no longer rerun the original query, but essentially download the results from the distributed storage directly. That is, what is configured for the distributed storage. See Configuring Distributed Storage for more information.

Result sets can be downloaded in one of the following formats:

  • JSON
  • CSV
  • Parquet

If you click (or command-click) on the format type, the dataset is downloaded:

  • To your default download location.
  • With the default file name of the user Job ID.

If you right-click on the format type, you have the option to select Save Link As…. This allows you to:

  • Specify the file name.
  • Specify the download location.

The downloaded data is the previous output query results set (the query is not re-run for the download action).