Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Corra: Correlation-Aware Column Compression (2403.17229v2)

Published 25 Mar 2024 in cs.DB

Abstract: Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is due to the neglect of cross-column correlations. For instance, consider the column pair ($\texttt{city}$, $\texttt{zip_code}$). Typically, cities have only a few dozen unique zip codes. If this information is properly exploited, it can significantly reduce the space consumption of the latter column. In this work, we depart from the established path of compressing data using only single-column encoding schemes and introduce several what we call $\textit{horizontal}$, correlation-aware encoding schemes. We demonstrate their advantages over single-column encoding schemes on the well-known TPC-H's $\texttt{lineitem}$, LDBC's $\texttt{message}$, DMV, and Taxi datasets. Our correlation-aware encoding schemes save up to 58.3% of the compressed size over single-column schemes for $\texttt{lineitem}$'s $\texttt{receiptdate}$, 53.7% for DMV's $\texttt{zip_code}$, and 85.16% for Taxi's $\texttt{total_amount}$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. [n. d.]. N. Y. C. Taxi, L. C. (TLC), Yellow Taxi trip records. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
  2. [n. d.]. State of New York. 2020. Vehicle, snowmobile, and boat registrations. https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations
  3. Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout: Decoding ¿ 100 Billion Integers per Second with Scalar Code. Proc. VLDB Endow. 16, 9 (may 2023), 2132–2144. https://doi.org/10.14778/3598581.3598587
  4. ALP: Adaptive Lossless floating-Point Compression. Proc. ACM Manag. Data 1, 4, Article 230 (dec 2023), 26 pages. https://doi.org/10.1145/3626717
  5. FSST: fast random access string compression. Proc. VLDB Endow. 13, 12 (jul 2020), 2649–2661. https://doi.org/10.14778/3407790.3407851
  6. HCompress: Hierarchical Data Compression for Multi-Tiered Storage Environments. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 557–566. https://doi.org/10.1109/IPDPS47924.2020.00064
  7. The LDBC Social Network Benchmark: Interactive Workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 619–630. https://doi.org/10.1145/2723372.2742786
  8. Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies. Proc. VLDB Endow. 2, 1 (2009), 1222–1233. https://doi.org/10.14778/1687627.1687765
  9. BtrBlocks: Efficient Columnar Compression for Data Lakes. Proc. ACM Manag. Data 1, 2, Article 118 (jun 2023), 26 pages. https://doi.org/10.1145/3589263
  10. Tree-Encoded Bitmaps. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 937–967. https://doi.org/10.1145/3318464.3380588
  11. Fast & Strong: The Case of Compressed String Dictionaries on Modern CPUs. In Proceedings of the 15th International Workshop on Data Management on New Hardware (Amsterdam, Netherlands) (DaMoN’19). Association for Computing Machinery, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3329785.3329924
  12. D. Lemire and L. Boytsov. 2013. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (May 2013), 1–29. https://doi.org/10.1002/spe.2203
  13. Yinan Li and Jignesh M. Patel. 2013. BitWeaving: fast scans for main memory data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, 289–300. https://doi.org/10.1145/2463676.2465322
  14. Chimp: Efficient Lossless Floating Point Compression for Time Series Databases. Proc. VLDB Endow. 15, 11 (2022), 3058–3070. https://doi.org/10.14778/3551793.3551852
  15. CorBit: Leveraging Correlations for Compressing Bitmap Indexes. In Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023 (CEUR Workshop Proceedings, Vol. 3462), Rajesh Bordawekar, Cinzia Cappiello, Vasilis Efthymiou, Lisa Ehrlinger, Vijay Gadepally, Sainyam Galhotra, Sandra Geisler, Sven Groppe, Le Gruenwald, Alon Y. Halevy, Hazar Harmouch, Oktie Hassanzadeh, Ihab F. Ilyas, Ernesto Jiménez-Ruiz, Sanjay Krishnan, Tirthankar Lahiri, Guoliang Li, Jiaheng Lu, Wolfgang Mauerer, Umar Farooq Minhas, Felix Naumann, M. Tamer Özsu, El Kindi Rezig, Kavitha Srinivas, Michael Stonebraker, Satyanarayana R. Valluri, Maria-Esther Vidal, Haixun Wang, Jiannan Wang, Yingjun Wu, Xun Xue, Mohamed Zaït, and Kai Zeng (Eds.). CEUR-WS.org. https://ceur-ws.org/Vol-3462/AIDB4.pdf
  16. Cortex: Harnessing Correlations to Boost Query Performance. arXiv:2012.06683 [cs.DB]
  17. The LDBC Social Network Benchmark: Business Intelligence Workload. Proc. VLDB Endow. 16, 4 (dec 2022), 877–890. https://doi.org/10.14778/3574245.3574270
  18. Chia-Yuan Teng and David L. Neuhoff. 1996. Hierarchical data compression. Ph. D. Dissertation. USA. AAI9712100.
  19. Transaction Processing Performance Council (TPC). 2022. TPC BENCHMARK™ H Standard Specification Revision 3.0.1. https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf. [Accessed 28-11-2023].
  20. Immanuel Trummer. 2023. Can Large Language Models Predict Data Correlations from Column Names? Proc. VLDB Endow. 16, 13 (sep 2023), 4310–4323. https://doi.org/10.14778/3625054.3625066
  21. HERMIT in action: succinct secondary indexing mechanism via correlation exploration. Proc. VLDB Endow. 12, 12 (aug 2019), 1882–1885. https://doi.org/10.14778/3352063.3352090
Citations (1)

Summary

We haven't generated a summary for this paper yet.

HackerNews