Creating a Virtual Dataset
Virtual datasets are built on top of the immutable physical datasets found in sources. In fact, you can think of a virtual dataset as a layered stack of data transformations that have been performed on top of one or more physical datasets.
Each virtual dataset is ultimately described by a SQL query. You can view and edit this query by expanding the SQL Editor box at the top of a dataset view.
Let’s imagine a physical dataset with an ‘id’ field that we want to drop.
After we drop this field, we save the dataset by hitting the Save button in the upper-right corner. Because physical datasets cannot be modified, this creates a new virtual dataset. Hitting the ‘Save’ button will prompt us to name our new virtual dataset and select from a list of spaces where it will be stored.
It is possible to create virtual datasets based on other virtual datasets or physical datasets.
There are a few considerations when chaining datasets:
- If a column that is used in the child dataset (either direct column reference or
select *) is dropped from the parent dataset, the child dataset must be updated. It is invalid until this is corrected.
- If a column is added to a parent dataset, it does not show up in the child dataset (even when using
select *) until it is updated.
When viewing a dataset, you can see a history of all the applied transformations by hovering over the vertical row of dots on the right hand side. The latest transformation is represented by an orange dot at the top of the list. Clicking on earlier transformations returns you to a previous state of the dataset, where you can inspect the contents and even save a new dataset that takes a divergent transformation timeline.
In this way Dremio’s dataset history can provide a number of functions, such as:
- A tool for inspecting how a dataset was created
- A visual ‘undo’
- A method for ‘forking’ earlier versions of datasets and taking them down different analysis pathways
Exploring Data Lineage
You can click the Graph button above the contents of a dataset on the upper right to inspect how it’s related to other datasets. Parent datasets and data sources are shown on the left, with child datasets located to the right.
Clicking on datasets in the graph moves the view left and right along the chain of inheritance, allowing you to inspect the relationship of the selected dataset to others on your Dremio cluster. You can refocus the graph on the selected dataset at any time by hitting the target icon.