CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs (2307.03760v1)
Abstract: Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to their high compute throughput and memory bandwidth. Prior works presume that decompression is memory-bound and have dedicated most of the GPU's threads to data movement and adopted complex software techniques to hide memory latency for reading compressed data and writing uncompressed data. This paper shows that these techniques lead to poor GPU resource utilization as most threads end up waiting for the few decoding threads, exposing compute and synchronization latencies. Based on this observation, we propose CODAG, a novel and simple kernel architecture for high throughput decompression on GPUs. CODAG eliminates the use of specialized groups of threads, frees up compute resources to increase the number of parallel decompression streams, and leverages the ample compute activities and the GPU's hardware scheduler to tolerate synchronization, compute, and memory latencies. Furthermore, CODAG provides a framework for users to easily incorporate new decompression algorithms without being burdened with implementing complex optimizations to hide memory latency. We validate our proposed architecture with three different encoding techniques, RLE v1, RLE v2, and Deflate, and a wide range of large datasets from different domains. We show that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and Deflate, respectively, when compared to the state-of-the-art decompressors from NVIDIA RAPIDS.
- “CUDA RAPIDS: GPU-Accelerated Data Analytics and Machine Learning,” https://developer.nvidia.com/rapids, 2020.
- A. S. Foundation, “Orc file format specification,” https://orc.apache.org/specification/, 2021.
- W. McKinney and the Pandas Development Team, “pandas: powerful python data analysis toolkit,” https://pandas.pydata.org/docs/, 2022.
- “A library for fast lossless compression/decompression on the gpu,” https://github.com/NVIDIA/nvcomp, 2020.
- “Accelerating Lossless GPU Compression with New Flexible Interfaces in NVIDIA nvCOMP,” https://developer.nvidia.com/blog/accelerating-lossless-gpu-compression-with-new-flexible-interfaces-in-nvidia-nvcomp/, 2022.
- “Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDFP,” https://developer.nvidia.com/blog/boosting-data-ingest-throughput-with-gpudirect-storage-and-rapids-cudf/, 2022.
- D. Wang, F. Zhang, W. Wan, H. Li, and X. Du, “Finequery: Fine-grained query processing on cpu-gpu integrated architectures,” in 2021 IEEE International Conference on Cluster Computing (CLUSTER), 2021, pp. 355–365.
- S. Floratos, M. Xiao, H. Wang, C. Guo, Y. Yuan, R. Lee, and X. Zhang, “Nestgpu: Nested query processing on gpu,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, pp. 1008–1019.
- Y. Liu, J. Wang, and S. Swanson, “Griffin: Uniting cpu and gpu in information retrieval systems for intra-query parallelism,” SIGPLAN Not., vol. 53, no. 1, p. 327–337, feb 2018. [Online]. Available: https://doi.org/10.1145/3200691.3178512
- J. Li, H.-W. Tseng, C. Lin, Y. Papakonstantinou, and S. Swanson, “HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics,” Proceedings of the VLDB Endowment, vol. 9, no. 14, p. 1647–1658, Oct. 2016.
- The City of New York, “TLC Trip Record Data,” https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.
- S. Funasaka, K. Nakano, and Y. Ito, “Adaptive loss-less data compression method optimized for gpu decompression,” Concurrency and Computation: Practice and Experience, vol. 29, no. 24, p. e4283, 2017.
- E. Sitaridi, R. Mueller, T. Kaldewey, G. Lohman, and K. A. Ross, “Massively-parallel lossless data decompression,” in 2016 45th International Conference on Parallel Processing (ICPP), 2016, pp. 242–247.
- A. Shanbhag and B. W. Yogatama, “Tile-based lightweight integer compression in gpu,” Proceedings of the 2022 International Conference on Management of Data, pp. 1390–1403, 2022.
- Z. Qureshi, V. S. Mailthody, I. Gelado, S. W. Min, A. Masood, J. Park, J. Xiong, C. Newburn, D. Vainbrand, I.-H. Chung, M. Garland, W. Dally, and W. mei Hwu, “Bam: A case for enabling fine-grain high throughput gpu-orchestrated access to storage,” 2022.
- A. S. Foundation, “Parquet file format specification,” https://parquet.apache.org/docs/file-format/, 2022.
- Criteo AI Lab, “Criteo 1TB Click Logs dataset ,” https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.
- P. Deutsch, “Rfc8478: Zstandard compression and the application/zstd media type,” https://tools.ietf.org/html/rfc1951, 1996.
- D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.
- “NVIDIA Tesla V100 GPU Architecture Whitepaper,” https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, 2017.
- “NVIDIA Tesla A100 Tensor Core GPU Architecture,” https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2020.
- J. G. G. Roelofs and M. Adler, “A massively spiffy yet delicately unobtrusive compression library,” https://www.zlib.net/, 2022.
- F. Mae, “Fannie mae single-family loan performance data,” https://docs.rapids.ai/datasets/mortgage-data.
- H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a Social Network or a News Media?” in Proceedings of the 19th International Conference on World Wide Web, ser. WWW ’10, Raleigh, NC, 2010.
- N. C. for Biotechnology Information, “Human reference genome (grch38-fasta),” https://www.ncbi.nlm.nih.gov/genome/guide/human/.
- M. Adler, “Parallel gzip,” https://zlib.net/pigz/, 2010.
- J. Gilchrist, “Parallel bzip2 file compressor,” https://linux.die.net/man/1/pbzip2, 2015.
- Y. Collet and M. Kucherawy, “Rfc8478: Zstandard compression and the application/zstd media type,” https://tools.ietf.org/html/rfc8478 and https://facebook.github.io/zstd/, 2015.
- H. Jang, C. Kim, and J. W. Lee, “Practical speculative parallelization of variable-length decompression algorithms,” in Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems, ser. LCTES ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 55–64.
- T. Lloyd, K. Barton, E. Tiotto, and J. N. Amaral, “Run-length base-delta encoding for high-speed compression,” in Proceedings of the 47th International Conference on Parallel Processing Companion, ser. ICPP ’18. New York, NY, USA: Association for Computing Machinery, 2018.
- R. A. Patel, Y. Zhang, J. Mak, A. Davidson, and J. D. Owens, “Parallel lossless data compression on the gpu,” in 2012 Innovative Parallel Computing (InPar), 2012, pp. 1–9.
- A. Ozsoy, M. Swany, and A. Chauhan, “Optimizing lzss compression on gpgpus,” Future Generation Computer Systems, vol. 30, pp. 170 – 178, 2014.
- M. A. O’Neil and M. Burtscher, “Floating-point data compression at 75 gb/s on a gpu,” in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: Association for Computing Machinery, 2011.
- A. Ozsoy, M. Swany, and A. Chauhan, “Pipelined parallel lzss for streaming data compression on gpgpus,” in 2012 IEEE 18th International Conference on Parallel and Distributed Systems, 2012, pp. 37–44.
- M. Olano, D. Baker, W. Griffin, and J. Barczak, “Variable bit rate gpu texture decompression,” in Proceedings of the Twenty-Second Eurographics Conference on Rendering, ser. EGSR ’11. Goslar, DEU: Eurographics Association, 2011, p. 1299–1308.