Efficiently Processing Joins and Grouped Aggregations on GPUs (2312.00720v2)
Abstract: There is a growing interest in leveraging GPUs for tasks beyond ML, especially in database systems. Despite the existing extensive work on GPU-based database operators, several questions are still open. For instance, the performance of almost all operators suffers from random accesses, which can account for up to 75% of the runtime. In addition, the group-by operator which is widely used in combination with joins, has not been fully explored for GPU acceleration. Furthermore, existing work often uses limited and unrepresentative workloads for evaluation and does not explore the query optimization aspect, i.e., how to choose the most efficient implementation based on the workload. In this paper, we revisit the state-of-the-art GPU-based join and group-by implementations. We identify their inefficiencies and propose several optimizations. We introduce GFTR, a novel technique to reduce random accesses, leading to speedups of up to 2.3x. We further optimize existing hash-based and sort-based group-by implementations, achieving significant speedups (19.4x and 1.7x, respectively). We also present a new partition-based group-by algorithm ideal for high group cardinalities. We analyze the optimizations with cost models, allowing us to predict the speedup. Finally, we conduct a performance evaluation to analyze each implementation. We conclude by providing practical heuristics to guide query optimizers in selecting the most efficient implementation for a given workload.
- Materialization Strategies in a Column-Oriented DBMS. In 2007 IEEE 23rd International Conference on Data Engineering. 466–475. https://doi.org/10.1109/ICDE.2007.367892
- TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
- Andy Adinets and Duane Merrill. 2022. Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. arXiv:2206.01784 [cs.DC]
- Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. Proc. VLDB Endow. 5, 10 (jun 2012), 1064–1075. https://doi.org/10.14778/2336664.2336678
- AMD. 2023. AMD Instinct MI250X Accelerator. https://www.amd.com/en/products/server-accelerators/instinct-mi250x
- Azure. 2023. GPU optimized virtual machine sizes. online. https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
- Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. Proc. VLDB Endow. 7, 1 (sep 2013), 85–96. https://doi.org/10.14778/2732219.2732227
- Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 362–373. https://doi.org/10.1109/ICDE.2013.6544839
- Main-Memory Hash Joins on Modern Processor Architectures. IEEE Transactions on Knowledge and Data Engineering 27, 7 (2015), 1754–1766. https://doi.org/10.1109/TKDE.2014.2313874
- To Partition, or Not to Partition, That is the Join Question in a Real System. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 168–180. https://doi.org/10.1145/3448016.3452831
- Sean Baxter. 2016. moderngpu 2.0. (2016). https://github.com/moderngpu/moderngpu/wiki.
- Design and Evaluation of Main Memory Hash Join Algorithms for Multi-Core CPUs. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD ’11). Association for Computing Machinery, New York, NY, USA, 37–48. https://doi.org/10.1145/1989323.1989328
- Sebastian Breß. 2014. The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum 14 (2014), 199 – 209. https://api.semanticscholar.org/CorpusID:256060600
- ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow. 13, 9 (may 2020), 1373–1387. https://doi.org/10.14778/3397230.3397235
- NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29–35. https://doi.org/10.1109/MM.2021.3061394
- HetExchange: Encapsulating Heterogeneous CPU-GPU Parallelism in JIT Compiled Engines. Proc. VLDB Endow. 12, 5 (jan 2019), 544–556. https://doi.org/10.14778/3303753.3303760
- NVIDIA Corporation. 2023a. Cooperative primitives for CUDA C++. https://github.com/NVIDIA/cub GitHub repository.
- NVIDIA Corporation. 2023b. NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core
- NVIDIA Corporation. 2023c. NVIDIA Nsight Compute. online. https://developer.nvidia.com/nsight-compute
- NVIDIA Corporation. 2023d. Thrust: Parallel Algorithms Library. https://github.com/NVIDIA/thrust GitHub repository.
- RAPIDS cuDF Developers. 2023. cuDF: GPU DataFrame Library. https://github.com/rapidsai/cudf GitHub repository.
- Bill Dally. 2011. Power, programmability, and granularity: The challenges of exascale computing. In 2011 IEEE International Test Conference. IEEE Computer Society, 12–12.
- Domain-specific hardware accelerators. Commun. ACM 63, 7 (2020), 48–57.
- A Case for Graphics-Driven Query Processing. Proc. VLDB Endow. 16, 10 (jun 2023), 2499–2511. https://doi.org/10.14778/3603581.3603590
- Pipelined Query Processing in Coprocessor Environments. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1603–1618. https://doi.org/10.1145/3183713.3183734
- Database Systems: The Complete Book (2 ed.). Prentice Hall Press, USA.
- Google. 2023. Google Cloud GPU Platforms. online. https://cloud.google.com/compute/docs/gpus
- GPU Merge Path: A GPU Merging Algorithm. In Proceedings of the 26th ACM International Conference on Supercomputing (San Servolo Island, Venice, Italy) (ICS ’12). Association for Computing Machinery, New York, NY, USA, 331–340. https://doi.org/10.1145/2304576.2304621
- Efficient gather and scatter operations on graphics processors. In SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. 1–12. https://doi.org/10.1145/1362622.1362684
- Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (dec 2009), 39 pages. https://doi.org/10.1145/1620585.1620588
- Relational Joins on Graphics Processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 511–524. https://doi.org/10.1145/1376616.1376670
- Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. 15, 11 (jul 2022), 2811–2825. https://doi.org/10.14778/3551793.3551833
- Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. Proc. VLDB Endow. 6, 10 (aug 2013), 889–900. https://doi.org/10.14778/2536206.2536216
- TCUDB: Accelerating Database with Tensor Processors. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1360–1374. https://doi.org/10.1145/3514221.3517869
- GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (Scottsdale, Arizona) (DaMoN ’12). Association for Computing Machinery, New York, NY, USA, 55–62. https://doi.org/10.1145/2236584.2236592
- Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. Proc. VLDB Endow. 2, 2 (aug 2009), 1378–1389. https://doi.org/10.14778/1687553.1687564
- Learning Generalized Linear Models Over Normalized Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1969–1984. https://doi.org/10.1145/2723372.2723713
- To Join or Not to Join? Thinking Twice about Joins before Feature Selection. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 19–34. https://doi.org/10.1145/2882903.2882952
- Revisiting Multi-Pass Scatter and Gather on GPUs. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP ’18). Association for Computing Machinery, New York, NY, USA, Article 25, 11 pages. https://doi.org/10.1145/3225058.3225095
- Accelerating Multi-Way Joins on the GPU. The VLDB Journal 31, 3 (nov 2021), 529–553. https://doi.org/10.1007/s00778-021-00708-y
- The Art of Balance: A RateupDB™ Experience of Building a CPU/GPU Hybrid Database Product. Proc. VLDB Endow. 14, 12 (jul 2021), 2999–3013. https://doi.org/10.14778/3476311.3476378
- HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. Proc. VLDB Endow. 9, 14 (oct 2016), 1647–1658. https://doi.org/10.14778/3007328.3007331
- Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1633–1649. https://doi.org/10.1145/3318464.3389705
- Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1017–1032. https://doi.org/10.1145/3514221.3517911
- Optimizing main-memory join on modern hardware. IEEE Transactions on Knowledge and Data Engineering 14, 4 (July 2002), 709–730. https://doi.org/10.1109/TKDE.2002.1019210
- Cache-Conscious Radix-Decluster Projections. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB ’04). VLDB Endowment, 684–695.
- A Tensor Compiler for Unified Machine Learning Prediction Serving. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 899–917. https://www.usenix.org/conference/osdi20/presentation/nakandala
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
- Revisiting hash join on graphics processors: A decade later. Distributed and Parallel Databases 38 (2020), 771–793.
- GPL: A GPU-Based Pipelined Query Processing Engine. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 1935–1950. https://doi.org/10.1145/2882903.2915224
- MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 1413–1425. https://doi.org/10.1145/3448016.3457254
- Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: An Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1981–1984. https://doi.org/10.1145/3299869.3320212
- Query Processing on Heterogeneous CPU/GPU Systems. ACM Comput. Surv. 55, 1, Article 11 (jan 2022), 38 pages. https://doi.org/10.1145/3485126
- Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (dec 2020), 708–720. https://doi.org/10.14778/3436905.3436927
- Ran Rui and Yi-Cheng Tu. 2017. Fast Equi-Join Algorithms on GPUs: Design and Implementation. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (Chicago, IL, USA) (SSDBM ’17). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. https://doi.org/10.1145/3085504.3085521
- Amazon Web Service. [n.d.]. AWS EC2 Instance Types. online. https://aws.amazon.com/ec2/instance-types/
- A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.1145/3318464.3380595
- Cache Conscious Algorithms for Relational Query Processing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 510–521.
- Materialization strategies in the Vertica analytic database: Lessons learned. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 1196–1207. https://doi.org/10.1109/ICDE.2013.6544909
- Hardware-Conscious Hash-Joins on GPUs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 698–709. https://doi.org/10.1109/ICDE.2019.00068
- An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs. In 2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW). 184–190. https://doi.org/10.1109/ICDEW58674.2023.00034
- Neil C. Thompson and Svenja Spanuth. 2021. The Decline of Computers as a General Purpose Technology. Commun. ACM 64, 3 (feb 2021), 64–72. https://doi.org/10.1145/3430936
- TPC. 2023a. TPC-DS Benchmark. http://www.tpc.org/tpcds/. Accessed: November 13, 2023.
- TPC. 2023b. TPC-H Benchmark. http://www.tpc.org/tpch/. Accessed: November 13, 2023.
- Relational Joins on GPUs: A Closer Look. IEEE Transactions on Parallel and Distributed Systems 28, 9 (2017), 2663–2673. https://doi.org/10.1109/TPDS.2017.2677451
- Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. Proc. VLDB Endow. 15, 11 (jul 2022), 2491–2503. https://doi.org/10.14778/3551793.3551809
- The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow. 6, 10 (aug 2013), 817–828. https://doi.org/10.14778/2536206.2536210
- Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 1042–1057. https://doi.org/10.1145/3470496.3533044