Corra: Correlation-Aware Column Compression (2403.17229v2)
Abstract: Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is due to the neglect of cross-column correlations. For instance, consider the column pair ($\texttt{city}$, $\texttt{zip_code}$). Typically, cities have only a few dozen unique zip codes. If this information is properly exploited, it can significantly reduce the space consumption of the latter column. In this work, we depart from the established path of compressing data using only single-column encoding schemes and introduce several what we call $\textit{horizontal}$, correlation-aware encoding schemes. We demonstrate their advantages over single-column encoding schemes on the well-known TPC-H's $\texttt{lineitem}$, LDBC's $\texttt{message}$, DMV, and Taxi datasets. Our correlation-aware encoding schemes save up to 58.3% of the compressed size over single-column schemes for $\texttt{lineitem}$'s $\texttt{receiptdate}$, 53.7% for DMV's $\texttt{zip_code}$, and 85.16% for Taxi's $\texttt{total_amount}$.
- [n. d.]. N. Y. C. Taxi, L. C. (TLC), Yellow Taxi trip records. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- [n. d.]. State of New York. 2020. Vehicle, snowmobile, and boat registrations. https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations
- Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout: Decoding ¿ 100 Billion Integers per Second with Scalar Code. Proc. VLDB Endow. 16, 9 (may 2023), 2132–2144. https://doi.org/10.14778/3598581.3598587
- ALP: Adaptive Lossless floating-Point Compression. Proc. ACM Manag. Data 1, 4, Article 230 (dec 2023), 26 pages. https://doi.org/10.1145/3626717
- FSST: fast random access string compression. Proc. VLDB Endow. 13, 12 (jul 2020), 2649–2661. https://doi.org/10.14778/3407790.3407851
- HCompress: Hierarchical Data Compression for Multi-Tiered Storage Environments. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 557–566. https://doi.org/10.1109/IPDPS47924.2020.00064
- The LDBC Social Network Benchmark: Interactive Workload. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 619–630. https://doi.org/10.1145/2723372.2742786
- Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies. Proc. VLDB Endow. 2, 1 (2009), 1222–1233. https://doi.org/10.14778/1687627.1687765
- BtrBlocks: Efficient Columnar Compression for Data Lakes. Proc. ACM Manag. Data 1, 2, Article 118 (jun 2023), 26 pages. https://doi.org/10.1145/3589263
- Tree-Encoded Bitmaps. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 937–967. https://doi.org/10.1145/3318464.3380588
- Fast & Strong: The Case of Compressed String Dictionaries on Modern CPUs. In Proceedings of the 15th International Workshop on Data Management on New Hardware (Amsterdam, Netherlands) (DaMoN’19). Association for Computing Machinery, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3329785.3329924
- D. Lemire and L. Boytsov. 2013. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (May 2013), 1–29. https://doi.org/10.1002/spe.2203
- Yinan Li and Jignesh M. Patel. 2013. BitWeaving: fast scans for main memory data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, 289–300. https://doi.org/10.1145/2463676.2465322
- Chimp: Efficient Lossless Floating Point Compression for Time Series Databases. Proc. VLDB Endow. 15, 11 (2022), 3058–3070. https://doi.org/10.14778/3551793.3551852
- CorBit: Leveraging Correlations for Compressing Bitmap Indexes. In Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023 (CEUR Workshop Proceedings, Vol. 3462), Rajesh Bordawekar, Cinzia Cappiello, Vasilis Efthymiou, Lisa Ehrlinger, Vijay Gadepally, Sainyam Galhotra, Sandra Geisler, Sven Groppe, Le Gruenwald, Alon Y. Halevy, Hazar Harmouch, Oktie Hassanzadeh, Ihab F. Ilyas, Ernesto Jiménez-Ruiz, Sanjay Krishnan, Tirthankar Lahiri, Guoliang Li, Jiaheng Lu, Wolfgang Mauerer, Umar Farooq Minhas, Felix Naumann, M. Tamer Özsu, El Kindi Rezig, Kavitha Srinivas, Michael Stonebraker, Satyanarayana R. Valluri, Maria-Esther Vidal, Haixun Wang, Jiannan Wang, Yingjun Wu, Xun Xue, Mohamed Zaït, and Kai Zeng (Eds.). CEUR-WS.org. https://ceur-ws.org/Vol-3462/AIDB4.pdf
- Cortex: Harnessing Correlations to Boost Query Performance. arXiv:2012.06683 [cs.DB]
- The LDBC Social Network Benchmark: Business Intelligence Workload. Proc. VLDB Endow. 16, 4 (dec 2022), 877–890. https://doi.org/10.14778/3574245.3574270
- Chia-Yuan Teng and David L. Neuhoff. 1996. Hierarchical data compression. Ph. D. Dissertation. USA. AAI9712100.
- Transaction Processing Performance Council (TPC). 2022. TPC BENCHMARK™ H Standard Specification Revision 3.0.1. https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf. [Accessed 28-11-2023].
- Immanuel Trummer. 2023. Can Large Language Models Predict Data Correlations from Column Names? Proc. VLDB Endow. 16, 13 (sep 2023), 4310–4323. https://doi.org/10.14778/3625054.3625066
- HERMIT in action: succinct secondary indexing mechanism via correlation exploration. Proc. VLDB Endow. 12, 12 (aug 2019), 1882–1885. https://doi.org/10.14778/3352063.3352090