Pillar 6 - Self-service Semantic Layer
Dremio has a unique capability in its semantic layer, where the physical structure of the underlying data storage is mapped to how the data is ultimately consumed ultimately via SQL queries. When the semantic layer is optimally designed and maintained, the data is easy to discover, queries are easy to write, and performance is optimized.
Principles
Layer Views
By layering your views, you can balance security, performance, and usability. Layered views will help you take the data in your physical tables and expose it to external consumption tools in the format they require, properly secure it, and address performance. A well-architected semantic layer consists of three layers to organize your views -- Preparation, Business, and Application. Each layer serves a purpose in transforming the data so it is ready for consumption by external tools
Annotate Datasets to Enhance Discovery and Understanding
Datasets can be tagged and documented in Dremio, allowing data to be more easily discovered. The data's veracity can be verified, and governance can be applied.
Best Practices
Use the Preparation Layer for 1:1 Mapping to Tables
The Preparation layer is closest to the data sources. This layer is used to organize and expose only the required datasets from a source rather than all datasets in a source. In this layer, each view is mapped to the table from which it is derived in the data source, and there are no joins to other views. Typically, a data engineer is responsible for preparing the data in this layer. In the Preparation layer, the data engineer should apply column aliasing so that all downstream views can use the normalized column names. Casting of column data types should be done in the Preparation layer so that all higher level views can leverage the correct type, and the conversion is done only once. Data should be cleansed in the Preparation layer so that it is centrally managed and all downstream views can used clean data. When derived columns based on existing columns are required, this should be done in the Preparation layer, thus enabling all future layers to use the new columns.
Use the Business Layer to Logically Join Datasets
The Business layer provides a holistic view of all data across your Dremio space. It is the first layer where joins among and between sources should occur, and all views in this layer must be built by either:
Querying resources in the Preparation layer. Views in the Business layer should start with selecting all columns from the Preparation layer of that view. This is typically a 1:1 mapping between the Preparation layer and the Business layer.
Querying other resources in the same Business layer. When joining two views together, they should be joined from the Business layer representation of that view, not the Preparation layer. This allows all changes made in the Business layer to propagate to all joins.
Use your list of common terms to describe the key business entities in your organization, such as a customer, product, or an order. Typically, a data modeler works with business experts and data providers to define the views that represent the business entities.
You can create many sub-layers inside the Business layer, each one consisting of views for different subject areas or verticals. These views are reusable components that can and should be shared across business lines. Typically, views do not filter rows or columns in the Business layer -- that is deferred to the Application layer.
Using this layer, you can improve productivity for analytics initiatives and minimize the risk of duplicative efforts in your organization by:
- reducing the cost of service delivery to lines of business
- providing a self service model for data engineers to quickly provision datasets
- enabling data consumers to quickly put those datasets to use and share with others
Use the Application Layer to Arrange Datasets for Consumption
Application layer views are arranged for the needs of data consumers, organizational departments, etc. Typically, data consumers, such as analysts or data scientists, leverage the views from the Business layer and work directly in the Application layer to create and modify views that are used in their own dashboards.
If the Application layer is providing for self-service access to Dremio’s semantic layer, you should expose, at a minimum, all Business layer views in the Application layer. Even if the view is SELECT * from BUSINESS_VIEW
, that provides logical separation for security and performance improvements.
If the Application layer is not for self-service but for particular applications, the views in the Application layer should be built on top of those self-service views in the Application layer and adding any application specific logic. Application logic should be row filters, as needed by the application. Columns can be left as-is, and then the list of columns selected by the application will be reduced in the SQL query.
Leverage Tags to Enhance Searchability
Create and assign tags to tables and views to enhance the discoverability of data across your organization. You can search for sets of tables and views based on a tag, or you can click on a tag to start a search based on that specific tag. The tagging functionality enables you to group related objects with the same tag. Objects can also have multiple tags so they can belong to different logical groups.
Create wiki content to embellish datasets
Use Dremio’s wiki functionality to add a description to a space, source, folder, table, or view. This will enhance the understanding of the data inside your organization. The wiki is extremely useful to provide more context around the dataset, such as descriptions for each column, or content that helps users get started with the data (e.g., usage examples, usage notes, or points of contact for questions or issues).
The formatting language for wikis is a Github-flavored markdown and supported by a rich text editor.
Use Data Lineage to Understand Relationships Between Objects
Use Dremio’s data lineage graph to understand the parents or children of a specific dataset and to understand how datasets combine. You can trace the lineage all the way from a view back to the table(s) in its associated data sources. The entire structure of tables, column names, data types, how many times it’s been queries etc. is presented as you traverse the graph.