Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Data Lakehouse: Data Warehousing and More

Published 12 Oct 2023 in cs.DB | (2310.08697v1)

Abstract: Relational Database Management Systems designed for Online Analytical Processing (RDBMS-OLAP) have been foundational to democratizing data and enabling analytical use cases such as business intelligence and reporting for many years. However, RDBMS-OLAP systems present some well-known challenges. They are primarily optimized only for relational workloads, lead to proliferation of data copies which can become unmanageable, and since the data is stored in proprietary formats, it can lead to vendor lock-in, restricting access to engines, tools, and capabilities beyond what the vendor offers. As the demand for data-driven decision making surges, the need for a more robust data architecture to address these challenges becomes ever more critical. Cloud data lakes have addressed some of the shortcomings of RDBMS-OLAP systems, but they present their own set of challenges. More recently, organizations have often followed a two-tier architectural approach to take advantage of both these platforms, leveraging both cloud data lakes and RDBMS-OLAP systems. However, this approach brings additional challenges, complexities, and overhead. This paper discusses how a data lakehouse, a new architectural approach, achieves the same benefits of an RDBMS-OLAP and cloud data lake combined, while also providing additional advantages. We take today's data warehousing and break it down into implementation independent components, capabilities, and practices. We then take these aspects and show how a lakehouse architecture satisfies them. Then, we go a step further and discuss what additional capabilities and benefits a lakehouse architecture provides over an RDBMS-OLAP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Apache Hudi. https://hudi.apache.org.
  2. Apache Iceberg Hidden Partitioning. https://iceberg.apache.org/docs/latest/partitioning/.
  3. Apache Iceberg: The open table format for analytic datasets. https://iceberg.apache.org.
  4. Apache Ranger. https://ranger.apache.org.
  5. Blue-green deployment in Software Engineering. https://en.wikipedia.org/wiki/Blue-green_deployment.
  6. Delta Lake. https://delta.io.
  7. Dremio Arctic. https://www.dremio.com/platform/arctic/.
  8. Dremio Sonar. https://www.dremio.com/platform/sonar/.
  9. Git: Version control. https://git-scm.com.
  10. IBM PureData System for Analytics Architecture. https://www.redbooks.ibm.com/redpapers/pdfs/redp4725.pdf.
  11. LakeFS. https://lakefs.io.
  12. Multi-statement transactions: BigQuery.
  13. Optimistic Concurrency Control. https://en.wikipedia.org/wiki/Optimistic_concurrency_control.
  14. Project Nessie. https://projectnessie.org.
  15. Scikit-learn: Machine Learning in Python. https://scikit-learn.org/stable/.
  16. Symmetric Multiprocessor Architecture. https://www.sciencedirect.com/science/article/abs/pii/B978012420158300006X.
  17. Tabular. https://tabular.io.
  18. Teradata Vantage Engine Architecture and Concepts. https://quickstarts.teradata.com/teradata-vantage-engine-architecture-and-concepts.html.
  19. Dremio Cloud Under the Hood. https://www.dremio.com/blog/dremio-cloud-under-the-hood/.
  20. Column-Stores vs. Row-Stores: How Different Are They Really?. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 967–980. https://doi.org/10.1145/1376616.1376712
  21. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems (Prague, Czech Republic) (EuroSys ’13). Association for Computing Machinery, New York, NY, USA, 29–42. https://doi.org/10.1145/2465351.2465355
  22. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang (Eds.). Morgan Kaufmann, 496–505. http://www.vldb.org/conf/2000/P496.pdf
  23. Amazon Redshift Re-Invented. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 2205–2217. https://doi.org/10.1145/3514221.3526045
  24. A High-Performance Distributed Relational Database System for Scalable OLAP Processing. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 738–748. https://doi.org/10.1109/IPDPS.2019.00083
  25. What can partitioning do for your data warehouses and data marts?. In Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789). 437–445. https://doi.org/10.1109/IDEAS.2000.880634
  26. Philip A. Bernstein and Nathan Goodman. 1981. Concurrency Control in Distributed Database Systems. ACM Comput. Surv. 13, 2 (jun 1981), 185–221. https://doi.org/10.1145/356842.356846
  27. A. Berson and L. Dubov. 2011. Master Data Management and Data Governance, Second Edition. McGraw-Hill/Osborne. https://books.google.ca/books?id=SnS8wgEACAAJ
  28. Li Cai and Yangyong Zhu. 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Science Journal (May 2015). https://doi.org/10.5334/dsj-2015-002
  29. Design and Selection of Materialized Views in a Data Warehousing Environment: A Case Study. In Proceedings of the 2nd ACM International Workshop on Data Warehousing and OLAP (Kansas City, Missouri, USA) (DOLAP ’99). Association for Computing Machinery, New York, NY, USA, 42–47. https://doi.org/10.1145/319757.319787
  30. Surajit Chaudhuri and Umeshwar Dayal. 1997. An Overview of Data Warehousing and OLAP Technology. SIGMOD Rec. 26, 1 (mar 1997), 65–74. https://doi.org/10.1145/248603.248616
  31. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 215–226. https://doi.org/10.1145/2882903.2903741
  32. Implementation of change data capture in ETL process for data warehouse using HDFS and apache spark. In 2017 International Workshop on Big Data and Information Security (IWBIS). 49–55. https://doi.org/10.1109/IWBIS.2017.8275102
  33. Dynamic management of data warehouse security levels based on user profiles. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt). 59–64. https://doi.org/10.1109/CIST.2016.7804961
  34. Sidra Faisal and Mansoor Sarwar. 2014. Handling slowly changing dimensions in data warehouses. Journal of Systems and Software 94 (2014), 151–160. https://doi.org/10.1016/j.jss.2014.03.072
  35. Hao Fan and Alexandra Poulovassilis. 2004. Schema Evolution in Data Warehousing Environments – A Schema Transformation-Based Approach. In Conceptual Modeling – ER 2004, Paolo Atzeni, Wesley Chu, Hongjun Lu, Shuigeng Zhou, and Tok-Wang Ling (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 639–653.
  36. Data Lakes: A Survey of Functions and Systems. IEEE Transactions on Knowledge & Data Engineering (2023), 1–20. https://doi.org/10.1109/TKDE.2023.3270101
  37. OLTP through the Looking Glass, and What We Found There. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 981–992. https://doi.org/10.1145/1376616.1376713
  38. Jason Hughes. Apache Iceberg: An architectural look under the covers. https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/.
  39. W.H. Inmon and Daniel Linstedt. 2015. 2.3 - Parallel Processing. In Data Architecture: a Primer for the Data Scientist, W.H. Inmon and Daniel Linstedt (Eds.). Morgan Kaufmann, Boston, 57–62. https://doi.org/10.1016/B978-0-12-802044-9.00010-6
  40. Sebastian Insausti. Running a Data Warehouse on PostgreSQL.
  41. Data lake: a new ideology in big data era. ITM Web Conf. 17 (2018), 03025. https://doi.org/10.1051/itmconf/20181703025
  42. Ralph Kimball and Margy Ross. 2011. The data warehouse toolkit: the complete guide to dimensional modeling. John Wiley & Sons.
  43. Mark Levene and George Loizou. 2003. Why is the Snowflake Schema a Good Data Warehouse Design? Inf. Syst. 28, 3 (may 2003), 225–240. https://doi.org/10.1016/S0306-4379(02)00021-2
  44. James Malone. Iceberg Tables: Powering Open Standards with Snowflake Innovations. https://www.snowflake.com/blog/iceberg-tables-powering-open-standards-with-snowflake-innovations/.
  45. Anuradha Manchar and Ankit Chouhan. 2017. Salesforce CRM: A new way of managing customer relationship in cloud environment. In 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT). 1–4. https://doi.org/10.1109/ICECCT.2017.8117887
  46. A. Mishra. 2019. Amazon S3. John Wiley Sons Ltd, Chapter 9, 181–200. https://doi.org/10.1002/9781119556749.ch9 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119556749.ch9
  47. Finding an efficient rewriting of OLAP queries using materialized views in data warehouses. Decision Support Systems 32, 4 (2002), 379–399. https://doi.org/10.1016/S0167-9236(01)00123-3
  48. Usability-based caching of query results in OLAP systems. Journal of Systems and Software 68, 2 (2003), 103–119. https://doi.org/10.1016/S0164-1212(02)00142-5
  49. Rethinking Concurrency Control for In-Memory OLAP DBMSs. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1453–1464. https://doi.org/10.1109/ICDE.2018.00164
  50. Ravindra Punuru. Four Reasons Data Lakes Are Moving to the Cloud. https://tdwi.org/articles/2019/09/03/arch-all-four-reasons-data-lakes-moving-to-cloud.aspx.
  51. A Comparison of Data-Driven and Data-Centric Architectures using E-Learning Solutions. In 2022 International Conference Advancement in Data Science, E-learning and Information Systems (ICADEIS). 1–6. https://doi.org/10.1109/ICADEIS56544.2022.10037358
  52. Seppo Sippu and Eljas Soisalon-Soininen. 2015. Transaction Processing: Management of the Logical Database and Its Underlying Physical Structure. Springer Publishing Company, Incorporated.
  53. C-Store: A Column-Oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases (Trondheim, Norway) (VLDB ’05). VLDB Endowment, 553–564.
  54. Building a serverless Data Lakehouse from spare parts. (2023). arXiv:2308.05368 [cs.DB]
  55. ANCA VADUVA and THOMAS VETTERLI. 2001. METADATA MANAGEMENT FOR DATA WAREHOUSING: AN OVERVIEW. International Journal of Cooperative Information Systems 10, 03 (2001), 273–298. https://doi.org/10.1142/S0218843001000357 arXiv:https://doi.org/10.1142/S0218843001000357
  56. Panos Vassiliadis. 2009. A Survey of Extract-Transform-Load Technology. Int. J. Data Warehous. Min. 5, 3 (2009), 1–27. https://doi.org/10.4018/jdwm.2009070101
  57. Deepak Vohra. 2016. Apache Parquet. Apress, Berkeley, CA, 325–335. https://doi.org/10.1007/978-1-4842-2199-0_8
  58. Adrienne Watt. 2014. Database Design. BCcampus. https://opentextbc.ca/dbdesign01/chapter/chapter-9-integrity-rules-and-constraints/
  59. Alex Woodie. The Cloud Is Great for Data, Except for Those Super High Costs.
  60. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In Conference on Innovative Data Systems Research.
Citations (3)

Summary

  • The paper introduces a unified data lakehouse architecture that combines ACID transactions with open data formats to overcome traditional data warehousing and cloud lake limitations.
  • It details a modular system using cloud storage, Apache Parquet, and Iceberg to enable efficient data governance and streamlined analytics workflows.
  • The paper demonstrates that implementing a data lakehouse minimizes data duplication and vendor lock-in while supporting concurrent BI and ML applications.

The Data Lakehouse: An Architectural Perspective

The concept of data lakehouse is introduced to address the limitations of traditional RDBMS-OLAP systems and cloud data lakes. It strives to unify these two architectures by combining their advantages while minimizing their drawbacks. This essay explores the technological facets and benefits of the data lakehouse, offering detailed insights into its components, capabilities, and practical implications.

Evolution of Data Architectures

Data architectures have evolved rapidly with the increasing need for efficient data management and analysis. Traditional RDBMS-OLAP systems optimized for relational workloads face challenges such as data proliferation, proprietary formats, and vendor lock-in. Cloud data lakes provide scalability and flexibility but lack transactional support. As a solution, organizations often implement a dual-tier architecture leveraging both systems, which inadvertently increases complexity and overhead. Figure 1

Figure 1: What is Data Warehousing?

Fundamentals of Data Lakehouse

A data lakehouse represents a convergence of data warehousing capabilities with the flexibility and scalability of a data lake. It achieves the following:

  • Transaction Support: Ensures ACID compliance similar to RDBMS-OLAP systems.
  • Open Data Format: Utilizes open formats like Apache Parquet and Iceberg, enabling diverse analytical engines to access data without locking into proprietary formats.
  • No Data Copy: Reduces the number of data copies by allowing direct access to raw data.
  • Governance: Supports robust data governance and regulatory compliance.
  • Schema Management: Facilitates schema evolution without affecting back-end consistency.
  • Scalability: Leverages separate compute and storage layers for optimal scalability. Figure 2

    Figure 2: A data lakehouse architecture with the various components.

Technical Components of the Lakehouse

The architecture of a data lakehouse is modular and comprises multiple components:

  • Data Storage: Utilizes cloud object stores for cost-effective, scalable data storage.
  • File Formats: Employs columnar open formats such as Apache Parquet.
  • Table Formats: Provides metadata layers for efficient data organization and transactional capabilities via Apache Iceberg or similar formats.
  • Catalogs: Maintains a registry of metadata to support efficient search and data access.
  • Compute Engines: Support diverse workloads from real-time dashboards to machine learning applications, taking advantage of MPP architecture. Figure 3

    Figure 3: An example data lakehouse implementation.

Implementing a Data Lakehouse

Implementing a data lakehouse involves choice and configuration of components based on specific requirements. An example setup includes:

  • A cloud object store for data files.
  • Apache Iceberg for table format management.
  • Project Nessie for cataloging.
  • Dremio Sonar or Apache Spark for compute engine tasks.

This architecture supports concurrent execution of analytical workflows without data duplication, enabling streamlined processing from BI to ML applications. Figure 4

Figure 4: A dashboard built on top of an Apache Iceberg table.

Advantages of Data Lakehousing

The data lakehouse provides several critical benefits, including but not limited to:

  • Future-Proof Architecture: Supports continuous integration with emerging technologies and analytics engines.
  • Vendor Independence: Ensures flexible interaction with various software providers without locking data.
  • Minimal Overhead: Reduces the need for large-scale data movements and complex ETL pipelines.
  • Enhanced Data Management: Allows for version control, governance, and management of data as code.

Conclusion

The data lakehouse model synthesizes the powerful elements of data warehousing and cloud data lakes into a unified, scalable, and flexible architecture. It minimizes the complexities associated with dual-tier systems and enables efficient multilateral data processing. By adopting open data formats and modular components, organizations can optimize analytics workflows across diverse applications, enhancing both data accessibility and governance. The data lakehouse represents a strategic evolution in managing and leveraging data in modern enterprises, aligning technological advancement with practical implementation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.