Data Lakehouse Architecture
- Data Lakehouse is a unified analytics architecture that integrates the scalability of data lakes with the transactional and governance strengths of data warehouses.
- It combines open storage, metadata catalogs, and table formats like Delta, Iceberg, and Hudi to enable efficient, interoperable data management.
- The architecture supports diverse workloads—from BI to ML—by offering ACID compliance, unified query processing, and streamlined composability for modern analytics.
A data lakehouse is a unified analytics architecture that aims to combine the flexibility and scalability of data lakes with the robust transactional and governance capabilities of data warehouses. Unlike purely schema-on-write warehouses, a data lakehouse integrates open object storage, metadata management, ACID transactions, and advanced query processing to support diverse workloads—including business intelligence (BI), ML, real-time and batch analytics—on both structured and unstructured data. The haLLMark of the data lakehouse paradigm is the exposure of open, governed, schema-rich, and versioned data to multiple analytical engines and use cases, while maintaining interoperability and scalability across heterogeneous platforms and storage backends.
1. Key Architectural Principles and Foundation
The data lakehouse paradigm synthesizes concepts from both data lakes and data warehouses by layering transactional processing and metadata management directly atop cloud object stores or distributed file systems. The foundational architecture typically includes:
- Open Storage Layer: The use of cloud object stores (e.g., Amazon S3, Azure Blob, Google Cloud Storage) or distributed file systems (such as HDFS or Ceph) for holding raw, immutable data files in open formats (commonly Apache Parquet, ORC, or JSON) (Mazumdar et al., 2023, Liu et al., 2020, Bag, 20 Mar 2025).
- Table Format Layer: Open table formats (e.g., Apache Iceberg, Delta Lake, Apache Hudi) sit above the storage, providing logical tables, schemas, ACID transaction support, versioning, and partitioning (Mazumdar et al., 2023, Priebe et al., 2022, Eswararaj et al., 18 Aug 2025).
- Metadata Catalog: Catalogs or metastore systems manage and index schema, partitioning, snapshots, and object locations, simplifying data discovery, enforcement of governance, and schema evolution (Sawadogo et al., 2019, Hai et al., 2021, Mazumdar et al., 2023).
- Compute and Processing Engines: Multiple engines—SQL engines (e.g., Spark, Flink, Dremio Sonar, DuckDB), ML platforms (e.g., MLflow), serverless runtimes, and specialized frameworks—can run directly on the lakehouse1 (Tagliabue et al., 2023, Fourny et al., 2021, Baker et al., 7 Aug 2025).
- Governance and Security: Fine-grained access control, audit logging, and version control (“data as code”) mechanisms are implemented at the catalog and table format layers (Mazumdar et al., 2023, Priebe et al., 2022).
This separation of concerns—storage, table format, catalog, compute—enables separation of compute and storage, elastic scalability, and loose coupling between analytics engines and physical storage (Mazumdar et al., 2023, Priebe et al., 2022).
2. Metadata Systems: Models, Features, and Importance
Metadata systems are essential for keeping large data lakehouse environments discoverable, governable, and analyzable. MEDAL, a graph-based metadata model, organizes metadata into (Sawadogo et al., 2019):
- Intra-object metadata: Per-object attributes (e.g., file size, modification date, schema, versions, summaries, semantic tags).
- Inter-object metadata: Relationships between objects (grouping, similarity links, derivation “parenthood”).
- Global metadata: Semantic resources (ontologies, taxonomies), index structures, and usage logs.
The formal structure can be expressed as:
Key evaluation criteria for lakehouse metadata systems, adapted from MEDAL, are: Semantic Enrichment, Data Indexing, Link Generation (object relations), Data Polymorphism (multiple representations), Data Versioning, and Usage Tracking (Sawadogo et al., 2019). MEDAL’s graph-based formalism with hypernodes and directed/hyperedges is well-suited for heterogeneous, evolving, and highly relational lakehouse data, supporting both ad hoc analytics and governance at scale.
3. Transactional Management and Data Consistency
Lakehouse architectures satisfy full ACID transaction semantics on open file formats using log- or manifest-based protocols (Mazumdar et al., 2023, Priebe et al., 2022, Götz et al., 29 Apr 2025). Advanced systems like LakeVilla extend these capabilities to multi-table and multi-query transactions, ensuring:
- Atomic reservation and commit across many tables using marker files in metadata.
- Isolation and concurrency control via dependency graphs and marker-shifting to avoid deadlocks, and a global version log to enforce serializability or even linearizability.
- Sublog-based recovery to provide undo/redo operations, preventing non-repeatable reads or stale data.
- Minimal performance overhead—as low as 2–2.5% even on write-heavy or read-heavy benchmarks (Götz et al., 29 Apr 2025)—by modular and non-invasive integration.
Traditional data lakes lacked such transactional robustness, leading to data inconsistencies when faced with concurrent, long-running analytical or update workloads. Modern lakehouses address this foundational requirement for mission-critical analytics and governed, multi-tenant data platforms.
4. Data Modeling, Partitioning, and Table Format Diversity
Three open table formats dominate contemporary lakehouse implementations—Delta Parquet, Apache Iceberg, and Apache Hudi—each with specific strengths and modeling strategies (Eswararaj et al., 18 Aug 2025):
Format | Strengths | Notable Mechanisms |
---|---|---|
Delta Parquet | ACID compliance, strong ML pipeline sync, governance | Append-only delta log, time travel, strict schema enforcement |
Iceberg | Batch query performance, engine-agnostic, cloud-native | Manifest files, snapshot isolation, hidden partitioning |
Hudi | Real-time ingestion, incremental MPP, CDC | Merge-on-Read (MOR), Copy-on-Write (COW), timeline-based commit model |
Modeling strategies include time-based partitioning, hidden/automatic partition abstraction (Iceberg), and use of partition evolution. Formats differ in update semantics: Hudi supports incremental ingest and fast upserts, Delta emphasizes batch analytics and stateful versioning, while Iceberg is preferred for high throughput/low-latency cloud-native batch jobs.
Data consistency is maintained via explicit logs (Delta), versioned manifests (Iceberg), or timeline markers (Hudi). Each format integrates time travel, schema evolution, and CDC, supporting diverse analytics, ML, and regulatory workloads.
5. Integration of Structured, Unstructured, and ML-Ready Data
Modern lakehouses are designed for mixed workloads—querying structured, semi-structured, and unstructured data, including high-dimensional tensors and images (Hambardzumyan et al., 2022, Bao et al., 3 May 2024, Kienzler et al., 2023). Architectures such as Deep Lake and Delta Tensor introduce:
- Native tensor storage: Multidimensional/chunked array storage (FTSF, ZARR), tensor-aware indexing, support for ragged/heterogeneous data, and efficient binary representations.
- Efficient streaming: Block-level, chunked data access (with HTTP range reads, index maps, hierarchical statistical indices) to bring data directly to GPU memory (Kienzler et al., 2023).
- Integration with ML frameworks: Seamless APIs for PyTorch, TensorFlow, or native DataFrame conversion, with dataset factories linked to relational queries and transformations.
- Version control and data lineage: Commit, checkout, diff, and merge for dataset provenance and reproducible ML/analytics pipelines.
In ML-centric lakehouses, tensor slicing, feature extraction, and high-throughput streaming directly to compute resources are first-class capabilities, minimizing serialization overhead and maximizing GPU/accelerator utilization.
6. Unified Query and Workflow Model, Composability, and FaaS
The lakehouse programming model increasingly trends toward composability and declarative orchestration. Platforms such as Bauplan integrate:
- Unified function-as-a-service (FaaS) runtimes: Every workload—SQL query, Python transformation, ML pipeline—is expressed as a function, underpinned by cloud-native, serverless containers (Tagliabue et al., 2023, Srivastava et al., 19 May 2025).
- Composability: Modular DAG construction from declarative code (SQL, Python), with implicit artifact dependency extraction and logical/physical plan separation.
- Optimized scheduling: Systems like Eudoxia simulate task scheduling, resource allocation, and preemption policies, allowing evaluation and tuning of scheduling algorithms before real cloud deployment (Srivastava et al., 19 May 2025).
- Reproducibility and versioned pipelines: Data pipelines are modeled as functions over immutable, time-traveled data artifacts, with Git-like versioning and branching semantics (e.g., using Nessie) (Tagliabue et al., 21 Apr 2024).
This abstraction enables reproducible, low-latency, scalable workflows, integrating ML, batch analytics, and BI in a single substrate. Developers benefit from ergonomic, CLI-driven environments that mirror “data as code” and modern DevOps practices.
7. Interoperability, Ecosystem Integration, and Future Directions
Given ongoing evolution in table formats and analytic engines, seamless interoperability is critical. XTable exemplifies such functionality (Agrawal et al., 17 Jan 2024):
- Omni-directional metadata translation: Portable, incremental, low-latency conversion among table formats (Delta, Iceberg, Hudi) at the metadata layer, with universal internal representation and commit-by-commit update efficiency.
- No data duplication: Physical data files remain untouched; only lightweight metadata translation enables engine and format diversity.
- Scenarios: Multi-format import/export, cross-team analytics, and optimization of engine-specific query performance without fragmentation or copy proliferation.
Lakehouse architectures increasingly intersect with cloud-native platforms (e.g., Kubernetes, Ceph, Airflow), bring machine learning to the data (Snowpark), and enable modular, secure, and reproducible research in domains from automotive to blockchain to the life sciences (Baker et al., 7 Aug 2025, Bag, 20 Mar 2025, Vargas-Solar et al., 29 Mar 2024). Open challenges remain, including schema evolution under high-velocity ingest, rich metadata enrichment, cross-domain security policy, and efficient ML deployment at lakehouse scale (Hai et al., 2021).
In conclusion, the data lakehouse is a convergent and extensible data architecture that provides robust, governed, and scalable analytics capabilities on heterogeneous data with an open, declaratively composable, and strongly interoperable foundation. It leverages open storage, flexible metadata modeling, multi-table transactional processing, modular workflow orchestration, and unified ML integration as core design elements—addressing many longstanding challenges of both classical data lakes and warehouses, while enabling next-generation analytics and AI at enterprise and cloud-native scale.