Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 TPS
Gemini 2.5 Pro 50 TPS Pro
GPT-5 Medium 32 TPS
GPT-5 High 30 TPS Pro
GPT-4o 67 TPS
GPT OSS 120B 452 TPS Pro
Kimi K2 190 TPS Pro
2000 character limit reached

Data Lakehouse Architecture

Updated 20 August 2025
  • Data Lakehouse is a unified analytics architecture that integrates the scalability of data lakes with the transactional and governance strengths of data warehouses.
  • It combines open storage, metadata catalogs, and table formats like Delta, Iceberg, and Hudi to enable efficient, interoperable data management.
  • The architecture supports diverse workloads—from BI to ML—by offering ACID compliance, unified query processing, and streamlined composability for modern analytics.

A data lakehouse is a unified analytics architecture that aims to combine the flexibility and scalability of data lakes with the robust transactional and governance capabilities of data warehouses. Unlike purely schema-on-write warehouses, a data lakehouse integrates open object storage, metadata management, ACID transactions, and advanced query processing to support diverse workloads—including business intelligence (BI), ML, real-time and batch analytics—on both structured and unstructured data. The haLLMark of the data lakehouse paradigm is the exposure of open, governed, schema-rich, and versioned data to multiple analytical engines and use cases, while maintaining interoperability and scalability across heterogeneous platforms and storage backends.

1. Key Architectural Principles and Foundation

The data lakehouse paradigm synthesizes concepts from both data lakes and data warehouses by layering transactional processing and metadata management directly atop cloud object stores or distributed file systems. The foundational architecture typically includes:

This separation of concerns—storage, table format, catalog, compute—enables separation of compute and storage, elastic scalability, and loose coupling between analytics engines and physical storage (Mazumdar et al., 2023, Priebe et al., 2022).

2. Metadata Systems: Models, Features, and Importance

Metadata systems are essential for keeping large data lakehouse environments discoverable, governable, and analyzable. MEDAL, a graph-based metadata model, organizes metadata into (Sawadogo et al., 2019):

  • Intra-object metadata: Per-object attributes (e.g., file size, modification date, schema, versions, summaries, semantic tags).
  • Inter-object metadata: Relationships between objects (grouping, similarity links, derivation “parenthood”).
  • Global metadata: Semantic resources (ontologies, taxonomies), index structures, and usage logs.

The formal structure can be expressed as:

DL=D,M,M=Mintra,Minter,Mglob\text{DL} = \langle \mathcal{D}, \mathcal{M} \rangle,\quad \mathcal{M} = \langle \mathcal{M}_{\text{intra}}, \mathcal{M}_{\text{inter}}, \mathcal{M}_{\text{glob}} \rangle

Key evaluation criteria for lakehouse metadata systems, adapted from MEDAL, are: Semantic Enrichment, Data Indexing, Link Generation (object relations), Data Polymorphism (multiple representations), Data Versioning, and Usage Tracking (Sawadogo et al., 2019). MEDAL’s graph-based formalism with hypernodes and directed/hyperedges is well-suited for heterogeneous, evolving, and highly relational lakehouse data, supporting both ad hoc analytics and governance at scale.

3. Transactional Management and Data Consistency

Lakehouse architectures satisfy full ACID transaction semantics on open file formats using log- or manifest-based protocols (Mazumdar et al., 2023, Priebe et al., 2022, Götz et al., 29 Apr 2025). Advanced systems like LakeVilla extend these capabilities to multi-table and multi-query transactions, ensuring:

  • Atomic reservation and commit across many tables using marker files in metadata.
  • Isolation and concurrency control via dependency graphs and marker-shifting to avoid deadlocks, and a global version log to enforce serializability or even linearizability.
  • Sublog-based recovery to provide undo/redo operations, preventing non-repeatable reads or stale data.
  • Minimal performance overhead—as low as 2–2.5% even on write-heavy or read-heavy benchmarks (Götz et al., 29 Apr 2025)—by modular and non-invasive integration.

Traditional data lakes lacked such transactional robustness, leading to data inconsistencies when faced with concurrent, long-running analytical or update workloads. Modern lakehouses address this foundational requirement for mission-critical analytics and governed, multi-tenant data platforms.

4. Data Modeling, Partitioning, and Table Format Diversity

Three open table formats dominate contemporary lakehouse implementations—Delta Parquet, Apache Iceberg, and Apache Hudi—each with specific strengths and modeling strategies (Eswararaj et al., 18 Aug 2025):

Format Strengths Notable Mechanisms
Delta Parquet ACID compliance, strong ML pipeline sync, governance Append-only delta log, time travel, strict schema enforcement
Iceberg Batch query performance, engine-agnostic, cloud-native Manifest files, snapshot isolation, hidden partitioning
Hudi Real-time ingestion, incremental MPP, CDC Merge-on-Read (MOR), Copy-on-Write (COW), timeline-based commit model

Modeling strategies include time-based partitioning, hidden/automatic partition abstraction (Iceberg), and use of partition evolution. Formats differ in update semantics: Hudi supports incremental ingest and fast upserts, Delta emphasizes batch analytics and stateful versioning, while Iceberg is preferred for high throughput/low-latency cloud-native batch jobs.

Data consistency is maintained via explicit logs (Delta), versioned manifests (Iceberg), or timeline markers (Hudi). Each format integrates time travel, schema evolution, and CDC, supporting diverse analytics, ML, and regulatory workloads.

5. Integration of Structured, Unstructured, and ML-Ready Data

Modern lakehouses are designed for mixed workloads—querying structured, semi-structured, and unstructured data, including high-dimensional tensors and images (Hambardzumyan et al., 2022, Bao et al., 3 May 2024, Kienzler et al., 2023). Architectures such as Deep Lake and Delta Tensor introduce:

  • Native tensor storage: Multidimensional/chunked array storage (FTSF, ZARR), tensor-aware indexing, support for ragged/heterogeneous data, and efficient binary representations.
  • Efficient streaming: Block-level, chunked data access (with HTTP range reads, index maps, hierarchical statistical indices) to bring data directly to GPU memory (Kienzler et al., 2023).
  • Integration with ML frameworks: Seamless APIs for PyTorch, TensorFlow, or native DataFrame conversion, with dataset factories linked to relational queries and transformations.
  • Version control and data lineage: Commit, checkout, diff, and merge for dataset provenance and reproducible ML/analytics pipelines.

In ML-centric lakehouses, tensor slicing, feature extraction, and high-throughput streaming directly to compute resources are first-class capabilities, minimizing serialization overhead and maximizing GPU/accelerator utilization.

6. Unified Query and Workflow Model, Composability, and FaaS

The lakehouse programming model increasingly trends toward composability and declarative orchestration. Platforms such as Bauplan integrate:

  • Unified function-as-a-service (FaaS) runtimes: Every workload—SQL query, Python transformation, ML pipeline—is expressed as a function, underpinned by cloud-native, serverless containers (Tagliabue et al., 2023, Srivastava et al., 19 May 2025).
  • Composability: Modular DAG construction from declarative code (SQL, Python), with implicit artifact dependency extraction and logical/physical plan separation.
  • Optimized scheduling: Systems like Eudoxia simulate task scheduling, resource allocation, and preemption policies, allowing evaluation and tuning of scheduling algorithms before real cloud deployment (Srivastava et al., 19 May 2025).
  • Reproducibility and versioned pipelines: Data pipelines are modeled as functions over immutable, time-traveled data artifacts, with Git-like versioning and branching semantics (e.g., using Nessie) (Tagliabue et al., 21 Apr 2024).

This abstraction enables reproducible, low-latency, scalable workflows, integrating ML, batch analytics, and BI in a single substrate. Developers benefit from ergonomic, CLI-driven environments that mirror “data as code” and modern DevOps practices.

7. Interoperability, Ecosystem Integration, and Future Directions

Given ongoing evolution in table formats and analytic engines, seamless interoperability is critical. XTable exemplifies such functionality (Agrawal et al., 17 Jan 2024):

  • Omni-directional metadata translation: Portable, incremental, low-latency conversion among table formats (Delta, Iceberg, Hudi) at the metadata layer, with universal internal representation and commit-by-commit update efficiency.
  • No data duplication: Physical data files remain untouched; only lightweight metadata translation enables engine and format diversity.
  • Scenarios: Multi-format import/export, cross-team analytics, and optimization of engine-specific query performance without fragmentation or copy proliferation.

Lakehouse architectures increasingly intersect with cloud-native platforms (e.g., Kubernetes, Ceph, Airflow), bring machine learning to the data (Snowpark), and enable modular, secure, and reproducible research in domains from automotive to blockchain to the life sciences (Baker et al., 7 Aug 2025, Bag, 20 Mar 2025, Vargas-Solar et al., 29 Mar 2024). Open challenges remain, including schema evolution under high-velocity ingest, rich metadata enrichment, cross-domain security policy, and efficient ML deployment at lakehouse scale (Hai et al., 2021).


In conclusion, the data lakehouse is a convergent and extensible data architecture that provides robust, governed, and scalable analytics capabilities on heterogeneous data with an open, declaratively composable, and strongly interoperable foundation. It leverages open storage, flexible metadata modeling, multi-table transactional processing, modular workflow orchestration, and unified ML integration as core design elements—addressing many longstanding challenges of both classical data lakes and warehouses, while enabling next-generation analytics and AI at enterprise and cloud-native scale.