Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XTable in Action: Seamless Interoperability in Data Lakes (2401.09621v1)

Published 17 Jan 2024 in cs.DB

Abstract: Contemporary approaches to data management are increasingly relying on unified analytics and AI platforms to foster collaboration, interoperability, seamless access to reliable data, and high performance. Data Lakes featuring open standard table formats such as Delta Lake, Apache Hudi, and Apache Iceberg are central components of these data architectures. Choosing the right format for managing a table is crucial for achieving the objectives mentioned above. The challenge lies in selecting the best format, a task that is onerous and can yield temporary results, as the ideal choice may shift over time with data growth, evolving workloads, and the competitive development of table formats and processing engines. Moreover, restricting data access to a single format can hinder data sharing resulting in diminished business value over the long term. The ability to seamlessly interoperate between formats and with negligible overhead can effectively address these challenges. Our solution in this direction is an innovative omni-directional translator, XTable, that facilitates writing data in one format and reading it in any format, thus achieving the desired format interoperability. In this work, we demonstrate the effectiveness of XTable through application scenarios inspired by real-world use cases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. 2017. Apache Hudi. https://hudi.apache.org/. Accessed: 2023-02-23.
  2. 2019. Delta Lake. https://delta.io/. Accessed: 2023-02-23.
  3. 2021. Apache Iceberg. https://iceberg.apache.org/. Accessed: 2023-02-23.
  4. 2023. Apache Avro. https://avro.apache.org/. Accessed: 2023-02-23.
  5. 2023. Apache ORC. https://orc.apache.org/. Accessed: 2023-02-23.
  6. 2023. Apache Paimon. https://paimon.apache.org. Accessed: 2023-11-29.
  7. 2023. Apache Parquet. https://parquet.apache.org/. Accessed: 2023-02-23.
  8. 2023a. Apple-Iceberg. https://trino.io/blog/2022/11/28/trino-summit-2022-apple-recap.html. Accessed: 2023-11-29.
  9. 2023. Databricks. https://www.databricks.com/. Accessed: 2023-11-26.
  10. 2023a. Fabric Interoperability. https://learn.microsoft.com/en-us/fabric/get-started/delta-lake-interoperability. Accessed: 2023-11-29.
  11. 2023. Google BigLake. https://cloud.google.com/biglake. Accessed: 2023-11-26.
  12. 2023. Microsoft Fabric. https://learn.microsoft.com/en-us/fabric/.
  13. 2023b. Migrate Action. https://iceberg.apache.org/docs/1.3.0/table-migration/. Accessed: 2023-12-07.
  14. 2023. OneTable. https://onetable.dev. Accessed: 2023-11-29.
  15. 2023. Starburst. https://docs.starburst.io/latest/object-storage.html. Accessed: 2023-11-29.
  16. 2023. Trino Hudi Connector Issue. https://github.com/trinodb/trino/pull/17899. Accessed: 2023-12-07.
  17. 2023b. Universal Format (UniForm). https://learn.microsoft.com/en-us/azure/databricks/delta/uniform. Accessed: 2023-06-23.
  18. 2023. Walmart-Hudi. https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b/. Accessed: 2023-12-08.
  19. LST-Bench: Benchmarking Log-Structured Tables in the Cloud. arXiv:2305.01120 [cs.DB]
  20. Analyzing and Comparing Lakehouse Storage Systems. CIDR (2023).
Citations (1)

Summary

We haven't generated a summary for this paper yet.