Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Data Lakehouse: Data Warehousing and More (2310.08697v1)

Published 12 Oct 2023 in cs.DB

Abstract: Relational Database Management Systems designed for Online Analytical Processing (RDBMS-OLAP) have been foundational to democratizing data and enabling analytical use cases such as business intelligence and reporting for many years. However, RDBMS-OLAP systems present some well-known challenges. They are primarily optimized only for relational workloads, lead to proliferation of data copies which can become unmanageable, and since the data is stored in proprietary formats, it can lead to vendor lock-in, restricting access to engines, tools, and capabilities beyond what the vendor offers. As the demand for data-driven decision making surges, the need for a more robust data architecture to address these challenges becomes ever more critical. Cloud data lakes have addressed some of the shortcomings of RDBMS-OLAP systems, but they present their own set of challenges. More recently, organizations have often followed a two-tier architectural approach to take advantage of both these platforms, leveraging both cloud data lakes and RDBMS-OLAP systems. However, this approach brings additional challenges, complexities, and overhead. This paper discusses how a data lakehouse, a new architectural approach, achieves the same benefits of an RDBMS-OLAP and cloud data lake combined, while also providing additional advantages. We take today's data warehousing and break it down into implementation independent components, capabilities, and practices. We then take these aspects and show how a lakehouse architecture satisfies them. Then, we go a step further and discuss what additional capabilities and benefits a lakehouse architecture provides over an RDBMS-OLAP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Apache Hudi. https://hudi.apache.org.
  2. Apache Iceberg Hidden Partitioning. https://iceberg.apache.org/docs/latest/partitioning/.
  3. Apache Iceberg: The open table format for analytic datasets. https://iceberg.apache.org.
  4. Apache Ranger. https://ranger.apache.org.
  5. Blue-green deployment in Software Engineering. https://en.wikipedia.org/wiki/Blue-green_deployment.
  6. Delta Lake. https://delta.io.
  7. Dremio Arctic. https://www.dremio.com/platform/arctic/.
  8. Dremio Sonar. https://www.dremio.com/platform/sonar/.
  9. Git: Version control. https://git-scm.com.
  10. IBM PureData System for Analytics Architecture. https://www.redbooks.ibm.com/redpapers/pdfs/redp4725.pdf.
  11. LakeFS. https://lakefs.io.
  12. Multi-statement transactions: BigQuery.
  13. Optimistic Concurrency Control. https://en.wikipedia.org/wiki/Optimistic_concurrency_control.
  14. Project Nessie. https://projectnessie.org.
  15. Scikit-learn: Machine Learning in Python. https://scikit-learn.org/stable/.
  16. Symmetric Multiprocessor Architecture. https://www.sciencedirect.com/science/article/abs/pii/B978012420158300006X.
  17. Tabular. https://tabular.io.
  18. Teradata Vantage Engine Architecture and Concepts. https://quickstarts.teradata.com/teradata-vantage-engine-architecture-and-concepts.html.
  19. Dremio Cloud Under the Hood. https://www.dremio.com/blog/dremio-cloud-under-the-hood/.
  20. Column-Stores vs. Row-Stores: How Different Are They Really?. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 967–980. https://doi.org/10.1145/1376616.1376712
  21. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems (Prague, Czech Republic) (EuroSys ’13). Association for Computing Machinery, New York, NY, USA, 29–42. https://doi.org/10.1145/2465351.2465355
  22. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang (Eds.). Morgan Kaufmann, 496–505. http://www.vldb.org/conf/2000/P496.pdf
  23. Amazon Redshift Re-Invented. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 2205–2217. https://doi.org/10.1145/3514221.3526045
  24. A High-Performance Distributed Relational Database System for Scalable OLAP Processing. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 738–748. https://doi.org/10.1109/IPDPS.2019.00083
  25. What can partitioning do for your data warehouses and data marts?. In Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789). 437–445. https://doi.org/10.1109/IDEAS.2000.880634
  26. Philip A. Bernstein and Nathan Goodman. 1981. Concurrency Control in Distributed Database Systems. ACM Comput. Surv. 13, 2 (jun 1981), 185–221. https://doi.org/10.1145/356842.356846
  27. A. Berson and L. Dubov. 2011. Master Data Management and Data Governance, Second Edition. McGraw-Hill/Osborne. https://books.google.ca/books?id=SnS8wgEACAAJ
  28. Li Cai and Yangyong Zhu. 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Science Journal (May 2015). https://doi.org/10.5334/dsj-2015-002
  29. Design and Selection of Materialized Views in a Data Warehousing Environment: A Case Study. In Proceedings of the 2nd ACM International Workshop on Data Warehousing and OLAP (Kansas City, Missouri, USA) (DOLAP ’99). Association for Computing Machinery, New York, NY, USA, 42–47. https://doi.org/10.1145/319757.319787
  30. Surajit Chaudhuri and Umeshwar Dayal. 1997. An Overview of Data Warehousing and OLAP Technology. SIGMOD Rec. 26, 1 (mar 1997), 65–74. https://doi.org/10.1145/248603.248616
  31. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 215–226. https://doi.org/10.1145/2882903.2903741
  32. Implementation of change data capture in ETL process for data warehouse using HDFS and apache spark. In 2017 International Workshop on Big Data and Information Security (IWBIS). 49–55. https://doi.org/10.1109/IWBIS.2017.8275102
  33. Dynamic management of data warehouse security levels based on user profiles. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt). 59–64. https://doi.org/10.1109/CIST.2016.7804961
  34. Sidra Faisal and Mansoor Sarwar. 2014. Handling slowly changing dimensions in data warehouses. Journal of Systems and Software 94 (2014), 151–160. https://doi.org/10.1016/j.jss.2014.03.072
  35. Hao Fan and Alexandra Poulovassilis. 2004. Schema Evolution in Data Warehousing Environments – A Schema Transformation-Based Approach. In Conceptual Modeling – ER 2004, Paolo Atzeni, Wesley Chu, Hongjun Lu, Shuigeng Zhou, and Tok-Wang Ling (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 639–653.
  36. Data Lakes: A Survey of Functions and Systems. IEEE Transactions on Knowledge & Data Engineering (2023), 1–20. https://doi.org/10.1109/TKDE.2023.3270101
  37. OLTP through the Looking Glass, and What We Found There. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 981–992. https://doi.org/10.1145/1376616.1376713
  38. Jason Hughes. Apache Iceberg: An architectural look under the covers. https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/.
  39. W.H. Inmon and Daniel Linstedt. 2015. 2.3 - Parallel Processing. In Data Architecture: a Primer for the Data Scientist, W.H. Inmon and Daniel Linstedt (Eds.). Morgan Kaufmann, Boston, 57–62. https://doi.org/10.1016/B978-0-12-802044-9.00010-6
  40. Sebastian Insausti. Running a Data Warehouse on PostgreSQL.
  41. Data lake: a new ideology in big data era. ITM Web Conf. 17 (2018), 03025. https://doi.org/10.1051/itmconf/20181703025
  42. Ralph Kimball and Margy Ross. 2011. The data warehouse toolkit: the complete guide to dimensional modeling. John Wiley & Sons.
  43. Mark Levene and George Loizou. 2003. Why is the Snowflake Schema a Good Data Warehouse Design? Inf. Syst. 28, 3 (may 2003), 225–240. https://doi.org/10.1016/S0306-4379(02)00021-2
  44. James Malone. Iceberg Tables: Powering Open Standards with Snowflake Innovations. https://www.snowflake.com/blog/iceberg-tables-powering-open-standards-with-snowflake-innovations/.
  45. Anuradha Manchar and Ankit Chouhan. 2017. Salesforce CRM: A new way of managing customer relationship in cloud environment. In 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT). 1–4. https://doi.org/10.1109/ICECCT.2017.8117887
  46. A. Mishra. 2019. Amazon S3. John Wiley Sons Ltd, Chapter 9, 181–200. https://doi.org/10.1002/9781119556749.ch9 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119556749.ch9
  47. Finding an efficient rewriting of OLAP queries using materialized views in data warehouses. Decision Support Systems 32, 4 (2002), 379–399. https://doi.org/10.1016/S0167-9236(01)00123-3
  48. Usability-based caching of query results in OLAP systems. Journal of Systems and Software 68, 2 (2003), 103–119. https://doi.org/10.1016/S0164-1212(02)00142-5
  49. Rethinking Concurrency Control for In-Memory OLAP DBMSs. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1453–1464. https://doi.org/10.1109/ICDE.2018.00164
  50. Ravindra Punuru. Four Reasons Data Lakes Are Moving to the Cloud. https://tdwi.org/articles/2019/09/03/arch-all-four-reasons-data-lakes-moving-to-cloud.aspx.
  51. A Comparison of Data-Driven and Data-Centric Architectures using E-Learning Solutions. In 2022 International Conference Advancement in Data Science, E-learning and Information Systems (ICADEIS). 1–6. https://doi.org/10.1109/ICADEIS56544.2022.10037358
  52. Seppo Sippu and Eljas Soisalon-Soininen. 2015. Transaction Processing: Management of the Logical Database and Its Underlying Physical Structure. Springer Publishing Company, Incorporated.
  53. C-Store: A Column-Oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases (Trondheim, Norway) (VLDB ’05). VLDB Endowment, 553–564.
  54. Building a serverless Data Lakehouse from spare parts. (2023). arXiv:2308.05368 [cs.DB]
  55. ANCA VADUVA and THOMAS VETTERLI. 2001. METADATA MANAGEMENT FOR DATA WAREHOUSING: AN OVERVIEW. International Journal of Cooperative Information Systems 10, 03 (2001), 273–298. https://doi.org/10.1142/S0218843001000357 arXiv:https://doi.org/10.1142/S0218843001000357
  56. Panos Vassiliadis. 2009. A Survey of Extract-Transform-Load Technology. Int. J. Data Warehous. Min. 5, 3 (2009), 1–27. https://doi.org/10.4018/jdwm.2009070101
  57. Deepak Vohra. 2016. Apache Parquet. Apress, Berkeley, CA, 325–335. https://doi.org/10.1007/978-1-4842-2199-0_8
  58. Adrienne Watt. 2014. Database Design. BCcampus. https://opentextbc.ca/dbdesign01/chapter/chapter-9-integrity-rules-and-constraints/
  59. Alex Woodie. The Cloud Is Great for Data, Except for Those Super High Costs.
  60. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In Conference on Innovative Data Systems Research.
Citations (3)

Summary

  • The paper demonstrates that the data lakehouse integrates the strengths of legacy data warehouses with cloud data lakes to eliminate data duplication and reduce vendor lock-in.
  • It details how open file and table formats, along with decoupled storage and compute, enable scalable, ACID-compliant transactions and low latency queries.
  • It provides a practical example for BI and machine learning using technologies like Apache Iceberg, Project Nessie, and Apache Spark to streamline data operations.

The paper "The Data Lakehouse: Data Warehousing and More" (2310.08697) presents the data lakehouse as a modern data architecture that combines the strengths of traditional data warehouses (specifically RDBMS-OLAP systems) and cloud data lakes, while addressing their respective limitations. The core argument is that a data lakehouse can fulfill the requirements of data warehousing practices and capabilities while offering additional benefits like handling diverse data types, avoiding vendor lock-in, reducing costs, and enabling advanced analytics like machine learning directly on the data lake.

Traditional RDBMS-OLAP data warehousing systems have been essential for business intelligence and reporting, optimized primarily for structured, relational data and SQL workloads. However, they face several challenges:

  • Limited to structured data: They are not well-suited for semi-structured or unstructured data, pushing organizations towards separate data lakes for such data and advanced analytics like machine learning.
  • Vendor lock-in and lock-out: Data is often stored in proprietary formats, limiting access by diverse tools and making data migration difficult. Exporting data to data lakes creates complex ETL pipelines and redundant data copies, leading to potential data drift and governance issues.
  • High costs: Storing large data volumes and running compute on proprietary systems can be expensive, especially with the need for pre-aggregated tables and materialized views.

The paper breaks down traditional data warehousing into:

  1. Technical Components (RDBMS-OLAP Components):
    • Data storage: Efficiently storing large volumes of data.
    • File format: How data is written within files (often proprietary, columnar formats like Parquet are beneficial for OLAP).
    • Table format: A metadata layer organizing data files (often proprietary in RDBMS-OLAP).
    • Storage engine: Manages data organization, updates, deletes, and constraints.
    • Compute engine: Executes queries, handles transformations, aggregations, often using Massively Parallel Processing (MPP).
    • Catalog: Stores metadata for data discovery.
  2. Technical Capabilities (RDBMS-OLAP Capabilities):
    • Governance and security: Access control (row/column-level, role-based), encryption, audit logging.
    • High concurrency: Handling multiple simultaneous reads and writes.
    • Low query latency: Achieved through optimization techniques like indexing, partitioning, caching, and query optimization.
    • Ad hoc queries: Support for interactive exploration.
    • Workload management (WLM): Managing resources for different types of workloads.
    • Schema and physical layout evolution: Adapting table structures over time without downtime or complex migrations.
    • ACID-compliant transactions: Ensuring atomicity, consistency, isolation, and durability for data modifications, including multi-statement/multi-table transactions and rollback capabilities.
    • Separation of storage and compute: A more recent feature in cloud RDBMS-OLAP systems, allowing independent scaling.
  3. Technology-Independent Practices (DW Practices):
    • Data modeling: Designing logical and physical models (e.g., star, snowflake schemas).
    • ETL/ELT: Processes for extracting, transforming, and loading data into the system. ELT (Extract, Load, Transform) is highlighted as more flexible, loading raw data first.
    • Data quality: Ensuring data accuracy and consistency through practices like Master Data Management (MDM), Referential Integrity, and handling Slowly Changing Dimensions (SCDs).

The paper introduces the Data Lakehouse as an architecture that addresses these points while overcoming the limitations of RDBMS-OLAP. A data lakehouse is characterized by:

  • Transactional support (ACID properties).
  • Storing data in open formats (file formats like Parquet, ORC, and table formats like Apache Iceberg, Apache Hudi, Delta Lake).
  • No unnecessary data copies, leveraging compute engines directly on the data lake.
  • Strong data quality and governance features.
  • Schema management capabilities.
  • Scalability via separated storage and compute.

In the data lakehouse architecture, the technical components are decoupled and often built using open source technologies or cloud services:

  • Data storage: Cloud object stores (S3, ADLS, GCS) offering low cost and massive scalability for any data type.
  • Storage engine: Handled by services or engines performing data management tasks like compaction, repartitioning (e.g., Dremio Arctic, Tabular, or engines like Spark/Flink with orchestration).
  • File format: Open columnar formats like Apache Parquet for efficient reads.
  • Table format: Open metadata layers (Apache Iceberg, Hudi, Delta Lake) providing ACID transactions, schema evolution, partitioning, and time travel on top of data files in object storage.
  • Catalog: A service tracking tables and their metadata (e.g., Apache Hive Metastore, AWS Glue, Project Nessie, Dremio Arctic, Tabular).
  • Compute engine: Decoupled engines optimized for specific workloads (e.g., Dremio Sonar for SQL, Apache Spark for ML, Apache Flink for streaming) that interact with data via the table format.

The paper demonstrates how a data lakehouse replicates or improves upon RDBMS-OLAP capabilities:

  • Governance and security: Tools like Apache Ranger or modern lakehouse catalogs provide fine-grained access control on the data lake.
  • High concurrency: Lakehouse platforms scale compute resources dynamically and use optimistic concurrency control via table formats for safe concurrent access.
  • Low query latency: Achieved through engine optimizations (query acceleration, caching) and table format features (partitioning, clustering, indexing, compaction).
  • Ad hoc queries: Supported by SQL-based compute engines, often with native BI tool connectivity.
  • Workload management (WLM): Achieved by using different engines for different workloads and configuring queueing/resource limits within engines.
  • Schema and physical layout evolution: Open table formats enable in-place schema changes and partition evolution without data rewriting.
  • ACID-compliant transactions: Core feature provided by table formats, enabling reliable data modifications and transaction rollback via snapshots. Multi-statement/multi-table transactions can be supported with version control systems like Project Nessie or LakeFS.
  • Separation of storage and compute: A fundamental principle of the lakehouse architecture, enabling independent scaling.

Technology-independent data warehousing practices are fully supported in a lakehouse:

  • Data modeling: Standard modeling techniques are applied to data in the lakehouse layers.
  • ELT: Becomes the preferred approach, loading raw data to the lake first for schema-on-read flexibility, then transforming it for schema-on-write benefits.
  • Data quality: MDM, referential integrity checks (via SQL engines), and SCD implementation (using table format features for row-level updates/deletes and history tracking) are all possible.

Additional value provided by the data lakehouse includes:

  • Open data architecture: Data is stored openly, preventing vendor lock-in and allowing the use of diverse, best-of-breed engines for various workloads.
  • Fewer data copies and better governance: Eliminates the need to copy data from a data lake to a separate data warehouse for BI, or export from the warehouse for ML, reducing complexity, cost, and governance challenges.
  • Manage data as code: Catalogs like Project Nessie and LakeFS enable Git-like version control for data tables, supporting isolated branches for experiments, data quality checks, and atomic merging for production rollouts (similar to blue-green deployments).
  • Federation: Compute engines can query data directly from the lakehouse and other data sources (like operational databases), enabling integrated analysis without centralizing all data.

The paper illustrates a basic lakehouse implementation for BI and machine learning using Apache Iceberg on Amazon S3, Project Nessie as a catalog, and Dremio Sonar and Apache Spark as compute engines.

The example demonstrates:

  1. Ingesting Parquet files into S3.
  2. Creating an Apache Iceberg table on top of these files, managed by a Project Nessie catalog.
  3. Using Dremio Sonar (a SQL engine connected to Nessie/Iceberg/S3) to query the data for BI dashboarding via tools like Tableau.
  4. Using Apache Spark (which also supports Iceberg) to read the same Iceberg table directly from S3 for machine learning model training using scikit-learn.

1
2
3
-- Example using Dremio Sonar to create an Iceberg table
CREATE TABLE arctic.telco.churnquarter AS
SELECT * FROM "churn-bigml-20_allfeat_Oct_train_data.parquet"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

df_telco = spark.read.table("arctic.telco.churnquarter").toPandas()

target = df_telco.iloc[:, -1].values
features = df_telco.iloc[:, :-1].values

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=101)

rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train, y_train)

predictions = rfc.predict(X_test)

acc = accuracy_score(y_test, predictions)
classReport = classification_report(y_test, predictions)
confMatrix = confusion_matrix(y_test, predictions)

print('Evaluation of the trained model:')
print('Accuracy:', acc)
print('Confusion Matrix:\n', confMatrix)
print('Classification Report:\n', classReport)

This example highlights the key benefit: different engines can access the same data copy, managed by the open table format and catalog, eliminating ETL complexity and data silos.

In conclusion, the paper argues that the data lakehouse architecture, built on open standards like Apache Iceberg and leveraging decoupled storage and compute, provides all the necessary components, capabilities, and support for data warehousing practices while simultaneously offering a more flexible, cost-effective, and future-proof platform capable of handling diverse data types and advanced analytical workloads directly on a single data copy.