Manage and Govern Your Data
Data management focuses on the operational efficiency, performance, and reliability of your data at scale. With Dremio’s autonomous management capabilities, many of these processes are intelligently automated; reducing manual effort and ensuring consistent optimization. Dremio automates table optimization by merging small files into optimally sized ones (typically around 512 MB), reducing metadata overhead, and reclaiming storage by physically removing deleted rows. It also reorganizes data to align with clustering specifications, ensuring consistent, high-performance queries across large datasets. Together, these autonomous management features help keep your lakehouse fast, efficient, and cost-effective.
Data governance is the foundation of a secure, reliable, and compliant lakehouse. It ensures that data across your environment is accurate, consistent, and properly controlled throughout its lifecycle. With Dremio, you can implement robust governance practices by maintaining complete data lineage for transparency and auditability, defining role-based and fine-grained (row-access and column-masking) access controls on data, and using documentation and tags to improve data discoverability. Together, these capabilities enable trustworthy, well-governed data that fuels analytics and AI with confidence.
Autonomous Management
Optimization
Managing Apache Iceberg tables is critical to maintaining fast and predictable query performance, especially for agentic AI workloads that demand low latency. As new data is ingested and tables are updated, metadata and small data files accumulate, leading to performance degradation over time. Dremio automates table optimization by merging small files into optimally sized ones (typically ~512 MB), reducing metadata overhead, organizing data to align with clustering specification and reclaiming storage by physically removing deleted rows.
Clustering
Dremio also reorganizes data to align with clustering specifications, ensuring consistent, high-performance queries at scale.
Materialize and Query Rewrite
Dremio can autonomously materialize datasets using Reflections, a precomputed and optimized copy of source data or a query result, designed to speed up query performance. Dremio's query optimizer can accelerate a query against tables or views by using one or more Reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source. Queries do not need to reference Reflections directly. Instead, Dremio rewrites queries on the fly to use the Reflections that satisfy them. For more information, see Reflections.
Governance
Lineage
Track and visualize how data flows through your lakehouse, from source to consumption. Lineage helps you understand data origins, track transformations, identify dependencies, and perform impact analysis.
Wikis
Enrich data understanding by documenting datasets with wikis. Use Generative AI to automatically generate wikis, reducing manual documentation effort. Wikis are used by Dremio's AI Agent to understand the semantics of your environment and adhere to these definitions in response to user prompts.
Labels
Enhance data discoverability and searchability by categorizing datasets with labels. Use Generative AI to automatically generate labels, reducing manual cataloging effort.
Role-Based Access Control Policies
Manage access to datasets through roles rather than individual user grants for easier administration. Assign privileges to roles, simplifying management and ensuring users only have access to what they need to perform their job.
Row-Access and Column-Masking Policies
Apply fine-grained access controls to protect sensitive data using row-access and column-masking policies. Control access to specific rows and columns based on rules and conditions to maintain compliance and adhere to regulatory requirements. For more information, see Row-Access & Column-Masking Policies.
Related Topics
- Roles – Manage role-based access control.
- Explore and Analyze Your Data - Explore and analyze your governed data.
- Catalog API - Lineage - Retrieve lineage information about datasets.