An Empirical Evaluation of Columnar Storage Formats (2304.05028v3)

Published 11 Apr 2023 in cs.DB

Abstract: Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.

Citations (14)

View on Semantic Scholar

Summary

The paper provides a benchmark analysis of Parquet and ORC, revealing that Parquet’s encoding methods yield faster decoding speeds.
The paper finds that modern hardware reduces the need for heavy block compression, exposing trade-offs between performance and storage efficiency.
The paper demonstrates ORC’s advantage in indexing for low-selectivity workloads, suggesting avenues for future enhancements in columnar formats.

Empirical Evaluation of Columnar Storage Formats

This paper presents a detailed analysis of open-source columnar storage formats, specifically Parquet and ORC, which are integral to modern data analytics systems. Developed to support Hadoop-based ecosystems, these formats have been adopted by a plethora of analytics platforms such as Hive, Spark, and Presto. However, given the substantial evolution of hardware and analytics workloads, the formats’ relevance needs reassessment to align with contemporary requirements.

Overview

The paper provides an extensive evaluation of Parquet and ORC by dissecting their internal architectures and proposes modifications for their improvement. A meticulously designed benchmark stresses these formats under varying workload configurations to uncover performance and space efficiency dynamics. The researchers concentrate on core aspects like encoding algorithms, block compression, metadata organization, indexing, filtering, and nested data modeling. Emphasis is placed on dictionary encoding automatically, decoding with enhanced speed over mere compression ratios, block compression as optional, and embedding more granular auxiliary data structures.

Numerical Insights and Performance

Encoding Practices: The analysis reveals that Parquet’s aggressive dictionary encoding, which spans various data types including integers, confers a slight size advantage. Although efficient file size is evidenced under low cardinality, the decoding speed of ORC is hampered by its complex choice of multiple algorithms. Parquet’s straightforward encoding schemes and strategic use of Bitpacking and RLE result in faster decoding.
Compression Trade-offs: The necessity for block compression is questioned given modern hardware capabilities. Empirical evidence indicates that block compression can degrade overall performance despite its ability to reduce storage consumption marginally. As storage and I/O capabilities improve, lesser reliance on heavy block compression algorithms becomes evident.
Indexing Efficacy: ORC’s advantage becomes notable under selection pruning given its finer \zm granularity compared to Parquet. However, this is situational and becomes impactful primarily under low-selectivity workloads.

Theoretical Implications

Theoretically, the research challenges traditional practices in format design, specifically addressing the balance between compression and computation. The finding suggests that prioritizing fast and efficient decoding schemes over higher compression is more productive in current hardware environments. This shift in priorities underscores a need for formats that adapt not just to historical trends but cater to evolving hardware efficiencies and workload demands.

Machine Learning Workloads

Addressing machine learning workloads reveals inefficiencies in existing formats for handling frequent projections of numerous features or low-selectivity queries on vector embeddings. This signals a critical gap that could drive future advancements where columnar formats accommodate the unique demands of ML datasets.

Prospective Advancements

Future advancements in columnar storage formats should focus on computationally economical encoding strategies, efficient metadata handling, and sophisticated indexing capabilities to meet the growing demands for faster and more efficient data processing pipelines. The suitability of formats should also be evaluated in the emerging context of hybrid cloud environments, with adaptations to benefit from cloud-native architectures’ latency and I/O characteristics.

In summary, this evaluation identifies key areas where columnar storage formats can evolve to better support contemporary data-intensive applications. By factoring in modern hardware capacities and diversified analytics workloads, future format iterations can sustain efficient data processing ecosystems that align with emergent technological trends.

PDF Markdown

Related Papers

Tweets

https://twitter.com/vanlightly/status/1749441122806485466

https://twitter.com/msftDataGuy/status/1753566592812888186

https://twitter.com/miguelinlas3/status/1749337069015515343

https://twitter.com/peerside/status/1749408166310281649

https://twitter.com/permutans/status/1878784829006938371

https://twitter.com/drupaulhudson/status/1791544118121771043

HackerNews

An Empirical Evaluation of Columnar Storage Formats (3 points, 1 comment)
An Empirical Evaluation of Columnar Storage Formats (2 points, 0 comments)
An Empirical Evaluation of Columnar Storage Formats [pdf] (1 point, 0 comments)
An Empirical Evaluation of Columnar Storage Formats (1 point, 0 comments)

Reddit

An Empirical Evaluation of Columnar Storage Formats (0 points, 0 comments)