Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

169 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

The Data Lakehouse: Data Warehousing and More (2310.08697v1)

Published 12 Oct 2023 in cs.DB

Abstract: Relational Database Management Systems designed for Online Analytical Processing (RDBMS-OLAP) have been foundational to democratizing data and enabling analytical use cases such as business intelligence and reporting for many years. However, RDBMS-OLAP systems present some well-known challenges. They are primarily optimized only for relational workloads, lead to proliferation of data copies which can become unmanageable, and since the data is stored in proprietary formats, it can lead to vendor lock-in, restricting access to engines, tools, and capabilities beyond what the vendor offers. As the demand for data-driven decision making surges, the need for a more robust data architecture to address these challenges becomes ever more critical. Cloud data lakes have addressed some of the shortcomings of RDBMS-OLAP systems, but they present their own set of challenges. More recently, organizations have often followed a two-tier architectural approach to take advantage of both these platforms, leveraging both cloud data lakes and RDBMS-OLAP systems. However, this approach brings additional challenges, complexities, and overhead. This paper discusses how a data lakehouse, a new architectural approach, achieves the same benefits of an RDBMS-OLAP and cloud data lake combined, while also providing additional advantages. We take today's data warehousing and break it down into implementation independent components, capabilities, and practices. We then take these aspects and show how a lakehouse architecture satisfies them. Then, we go a step further and discuss what additional capabilities and benefits a lakehouse architecture provides over an RDBMS-OLAP.

References (60)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that the data lakehouse integrates the strengths of legacy data warehouses with cloud data lakes to eliminate data duplication and reduce vendor lock-in.
It details how open file and table formats, along with decoupled storage and compute, enable scalable, ACID-compliant transactions and low latency queries.
It provides a practical example for BI and machine learning using technologies like Apache Iceberg, Project Nessie, and Apache Spark to streamline data operations.

The paper "The Data Lakehouse: Data Warehousing and More" (2310.08697) presents the data lakehouse as a modern data architecture that combines the strengths of traditional data warehouses (specifically RDBMS-OLAP systems) and cloud data lakes, while addressing their respective limitations. The core argument is that a data lakehouse can fulfill the requirements of data warehousing practices and capabilities while offering additional benefits like handling diverse data types, avoiding vendor lock-in, reducing costs, and enabling advanced analytics like machine learning directly on the data lake.

Traditional RDBMS-OLAP data warehousing systems have been essential for business intelligence and reporting, optimized primarily for structured, relational data and SQL workloads. However, they face several challenges:

Limited to structured data: They are not well-suited for semi-structured or unstructured data, pushing organizations towards separate data lakes for such data and advanced analytics like machine learning.
Vendor lock-in and lock-out: Data is often stored in proprietary formats, limiting access by diverse tools and making data migration difficult. Exporting data to data lakes creates complex ETL pipelines and redundant data copies, leading to potential data drift and governance issues.
High costs: Storing large data volumes and running compute on proprietary systems can be expensive, especially with the need for pre-aggregated tables and materialized views.

The paper breaks down traditional data warehousing into:

Technical Components (RDBMS-OLAP Components):
- Data storage: Efficiently storing large volumes of data.
- File format: How data is written within files (often proprietary, columnar formats like Parquet are beneficial for OLAP).
- Table format: A metadata layer organizing data files (often proprietary in RDBMS-OLAP).
- Storage engine: Manages data organization, updates, deletes, and constraints.
- Compute engine: Executes queries, handles transformations, aggregations, often using Massively Parallel Processing (MPP).
- Catalog: Stores metadata for data discovery.
Technical Capabilities (RDBMS-OLAP Capabilities):
- Governance and security: Access control (row/column-level, role-based), encryption, audit logging.
- High concurrency: Handling multiple simultaneous reads and writes.
- Low query latency: Achieved through optimization techniques like indexing, partitioning, caching, and query optimization.
- Ad hoc queries: Support for interactive exploration.
- Workload management (WLM): Managing resources for different types of workloads.
- Schema and physical layout evolution: Adapting table structures over time without downtime or complex migrations.
- ACID-compliant transactions: Ensuring atomicity, consistency, isolation, and durability for data modifications, including multi-statement/multi-table transactions and rollback capabilities.
- Separation of storage and compute: A more recent feature in cloud RDBMS-OLAP systems, allowing independent scaling.
Technology-Independent Practices (DW Practices):
- Data modeling: Designing logical and physical models (e.g., star, snowflake schemas).
- ETL/ELT: Processes for extracting, transforming, and loading data into the system. ELT (Extract, Load, Transform) is highlighted as more flexible, loading raw data first.
- Data quality: Ensuring data accuracy and consistency through practices like Master Data Management (MDM), Referential Integrity, and handling Slowly Changing Dimensions (SCDs).

The paper introduces the Data Lakehouse as an architecture that addresses these points while overcoming the limitations of RDBMS-OLAP. A data lakehouse is characterized by:

Transactional support (ACID properties).
Storing data in open formats (file formats like Parquet, ORC, and table formats like Apache Iceberg, Apache Hudi, Delta Lake).
No unnecessary data copies, leveraging compute engines directly on the data lake.
Strong data quality and governance features.
Schema management capabilities.
Scalability via separated storage and compute.

In the data lakehouse architecture, the technical components are decoupled and often built using open source technologies or cloud services:

Data storage: Cloud object stores (S3, ADLS, GCS) offering low cost and massive scalability for any data type.
Storage engine: Handled by services or engines performing data management tasks like compaction, repartitioning (e.g., Dremio Arctic, Tabular, or engines like Spark/Flink with orchestration).
File format: Open columnar formats like Apache Parquet for efficient reads.
Table format: Open metadata layers (Apache Iceberg, Hudi, Delta Lake) providing ACID transactions, schema evolution, partitioning, and time travel on top of data files in object storage.
Catalog: A service tracking tables and their metadata (e.g., Apache Hive Metastore, AWS Glue, Project Nessie, Dremio Arctic, Tabular).
Compute engine: Decoupled engines optimized for specific workloads (e.g., Dremio Sonar for SQL, Apache Spark for ML, Apache Flink for streaming) that interact with data via the table format.

The paper demonstrates how a data lakehouse replicates or improves upon RDBMS-OLAP capabilities:

Governance and security: Tools like Apache Ranger or modern lakehouse catalogs provide fine-grained access control on the data lake.
High concurrency: Lakehouse platforms scale compute resources dynamically and use optimistic concurrency control via table formats for safe concurrent access.
Low query latency: Achieved through engine optimizations (query acceleration, caching) and table format features (partitioning, clustering, indexing, compaction).
Ad hoc queries: Supported by SQL-based compute engines, often with native BI tool connectivity.
Workload management (WLM): Achieved by using different engines for different workloads and configuring queueing/resource limits within engines.
Schema and physical layout evolution: Open table formats enable in-place schema changes and partition evolution without data rewriting.
ACID-compliant transactions: Core feature provided by table formats, enabling reliable data modifications and transaction rollback via snapshots. Multi-statement/multi-table transactions can be supported with version control systems like Project Nessie or LakeFS.
Separation of storage and compute: A fundamental principle of the lakehouse architecture, enabling independent scaling.

Technology-independent data warehousing practices are fully supported in a lakehouse:

Data modeling: Standard modeling techniques are applied to data in the lakehouse layers.
ELT: Becomes the preferred approach, loading raw data to the lake first for schema-on-read flexibility, then transforming it for schema-on-write benefits.
Data quality: MDM, referential integrity checks (via SQL engines), and SCD implementation (using table format features for row-level updates/deletes and history tracking) are all possible.

Additional value provided by the data lakehouse includes:

Open data architecture: Data is stored openly, preventing vendor lock-in and allowing the use of diverse, best-of-breed engines for various workloads.
Fewer data copies and better governance: Eliminates the need to copy data from a data lake to a separate data warehouse for BI, or export from the warehouse for ML, reducing complexity, cost, and governance challenges.
Manage data as code: Catalogs like Project Nessie and LakeFS enable Git-like version control for data tables, supporting isolated branches for experiments, data quality checks, and atomic merging for production rollouts (similar to blue-green deployments).
Federation: Compute engines can query data directly from the lakehouse and other data sources (like operational databases), enabling integrated analysis without centralizing all data.

The paper illustrates a basic lakehouse implementation for BI and machine learning using Apache Iceberg on Amazon S3, Project Nessie as a catalog, and Dremio Sonar and Apache Spark as compute engines.

The example demonstrates:

Ingesting Parquet files into S3.
Creating an Apache Iceberg table on top of these files, managed by a Project Nessie catalog.
Using Dremio Sonar (a SQL engine connected to Nessie/Iceberg/S3) to query the data for BI dashboarding via tools like Tableau.
Using Apache Spark (which also supports Iceberg) to read the same Iceberg table directly from S3 for machine learning model training using scikit-learn.

1
2
3

-- Example using Dremio Sonar to create an Iceberg table
CREATE TABLE arctic.telco.churnquarter AS
SELECT * FROM "churn-bigml-20_allfeat_Oct_train_data.parquet"

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

df_telco = spark.read.table("arctic.telco.churnquarter").toPandas()

target = df_telco.iloc[:, -1].values
features = df_telco.iloc[:, :-1].values

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=101)

rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train, y_train)

predictions = rfc.predict(X_test)

acc = accuracy_score(y_test, predictions)
classReport = classification_report(y_test, predictions)
confMatrix = confusion_matrix(y_test, predictions)

print('Evaluation of the trained model:')
print('Accuracy:', acc)
print('Confusion Matrix:\n', confMatrix)
print('Classification Report:\n', classReport)

This example highlights the key benefit: different engines can access the same data copy, managed by the open table format and catalog, eliminating ETL complexity and data silos.

In conclusion, the paper argues that the data lakehouse architecture, built on open standards like Apache Iceberg and leveraging decoupled storage and compute, provides all the necessary components, capabilities, and support for data warehousing practices while simultaneously offering a more flexible, cost-effective, and future-proof platform capable of handling diverse data types and advanced analytical workloads directly on a single data copy.

PDF Markdown

Tweets

https://twitter.com/Dipankartnt/status/1753466701092405483

https://twitter.com/Dipankartnt/status/1777000634652172766

The Data Lakehouse: Data Warehousing and More (2310.08697v1)

Summary

Related Papers

Tweets