Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficiently Processing Joins and Grouped Aggregations on GPUs (2312.00720v2)

Published 1 Dec 2023 in cs.DB

Abstract: There is a growing interest in leveraging GPUs for tasks beyond ML, especially in database systems. Despite the existing extensive work on GPU-based database operators, several questions are still open. For instance, the performance of almost all operators suffers from random accesses, which can account for up to 75% of the runtime. In addition, the group-by operator which is widely used in combination with joins, has not been fully explored for GPU acceleration. Furthermore, existing work often uses limited and unrepresentative workloads for evaluation and does not explore the query optimization aspect, i.e., how to choose the most efficient implementation based on the workload. In this paper, we revisit the state-of-the-art GPU-based join and group-by implementations. We identify their inefficiencies and propose several optimizations. We introduce GFTR, a novel technique to reduce random accesses, leading to speedups of up to 2.3x. We further optimize existing hash-based and sort-based group-by implementations, achieving significant speedups (19.4x and 1.7x, respectively). We also present a new partition-based group-by algorithm ideal for high group cardinalities. We analyze the optimizations with cost models, allowing us to predict the speedup. Finally, we conduct a performance evaluation to analyze each implementation. We conclude by providing practical heuristics to guide query optimizers in selecting the most efficient implementation for a given workload.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Materialization Strategies in a Column-Oriented DBMS. In 2007 IEEE 23rd International Conference on Data Engineering. 466–475. https://doi.org/10.1109/ICDE.2007.367892
  2. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
  3. Andy Adinets and Duane Merrill. 2022. Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. arXiv:2206.01784 [cs.DC]
  4. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. Proc. VLDB Endow. 5, 10 (jun 2012), 1064–1075. https://doi.org/10.14778/2336664.2336678
  5. AMD. 2023. AMD Instinct MI250X Accelerator. https://www.amd.com/en/products/server-accelerators/instinct-mi250x
  6. Azure. 2023. GPU optimized virtual machine sizes. online. https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
  7. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. Proc. VLDB Endow. 7, 1 (sep 2013), 85–96. https://doi.org/10.14778/2732219.2732227
  8. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 362–373. https://doi.org/10.1109/ICDE.2013.6544839
  9. Main-Memory Hash Joins on Modern Processor Architectures. IEEE Transactions on Knowledge and Data Engineering 27, 7 (2015), 1754–1766. https://doi.org/10.1109/TKDE.2014.2313874
  10. To Partition, or Not to Partition, That is the Join Question in a Real System. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 168–180. https://doi.org/10.1145/3448016.3452831
  11. Sean Baxter. 2016. moderngpu 2.0. (2016). https://github.com/moderngpu/moderngpu/wiki.
  12. Design and Evaluation of Main Memory Hash Join Algorithms for Multi-Core CPUs. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD ’11). Association for Computing Machinery, New York, NY, USA, 37–48. https://doi.org/10.1145/1989323.1989328
  13. Sebastian Breß. 2014. The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum 14 (2014), 199 – 209. https://api.semanticscholar.org/CorpusID:256060600
  14. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow. 13, 9 (may 2020), 1373–1387. https://doi.org/10.14778/3397230.3397235
  15. NVIDIA A100 Tensor Core GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29–35. https://doi.org/10.1109/MM.2021.3061394
  16. HetExchange: Encapsulating Heterogeneous CPU-GPU Parallelism in JIT Compiled Engines. Proc. VLDB Endow. 12, 5 (jan 2019), 544–556. https://doi.org/10.14778/3303753.3303760
  17. NVIDIA Corporation. 2023a. Cooperative primitives for CUDA C++. https://github.com/NVIDIA/cub GitHub repository.
  18. NVIDIA Corporation. 2023b. NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core
  19. NVIDIA Corporation. 2023c. NVIDIA Nsight Compute. online. https://developer.nvidia.com/nsight-compute
  20. NVIDIA Corporation. 2023d. Thrust: Parallel Algorithms Library. https://github.com/NVIDIA/thrust GitHub repository.
  21. RAPIDS cuDF Developers. 2023. cuDF: GPU DataFrame Library. https://github.com/rapidsai/cudf GitHub repository.
  22. Bill Dally. 2011. Power, programmability, and granularity: The challenges of exascale computing. In 2011 IEEE International Test Conference. IEEE Computer Society, 12–12.
  23. Domain-specific hardware accelerators. Commun. ACM 63, 7 (2020), 48–57.
  24. A Case for Graphics-Driven Query Processing. Proc. VLDB Endow. 16, 10 (jun 2023), 2499–2511. https://doi.org/10.14778/3603581.3603590
  25. Pipelined Query Processing in Coprocessor Environments. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1603–1618. https://doi.org/10.1145/3183713.3183734
  26. Database Systems: The Complete Book (2 ed.). Prentice Hall Press, USA.
  27. Google. 2023. Google Cloud GPU Platforms. online. https://cloud.google.com/compute/docs/gpus
  28. GPU Merge Path: A GPU Merging Algorithm. In Proceedings of the 26th ACM International Conference on Supercomputing (San Servolo Island, Venice, Italy) (ICS ’12). Association for Computing Machinery, New York, NY, USA, 331–340. https://doi.org/10.1145/2304576.2304621
  29. Efficient gather and scatter operations on graphics processors. In SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. 1–12. https://doi.org/10.1145/1362622.1362684
  30. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (dec 2009), 39 pages. https://doi.org/10.1145/1620585.1620588
  31. Relational Joins on Graphics Processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 511–524. https://doi.org/10.1145/1376616.1376670
  32. Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. 15, 11 (jul 2022), 2811–2825. https://doi.org/10.14778/3551793.3551833
  33. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. Proc. VLDB Endow. 6, 10 (aug 2013), 889–900. https://doi.org/10.14778/2536206.2536216
  34. TCUDB: Accelerating Database with Tensor Processors. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1360–1374. https://doi.org/10.1145/3514221.3517869
  35. GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (Scottsdale, Arizona) (DaMoN ’12). Association for Computing Machinery, New York, NY, USA, 55–62. https://doi.org/10.1145/2236584.2236592
  36. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. Proc. VLDB Endow. 2, 2 (aug 2009), 1378–1389. https://doi.org/10.14778/1687553.1687564
  37. Learning Generalized Linear Models Over Normalized Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1969–1984. https://doi.org/10.1145/2723372.2723713
  38. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 19–34. https://doi.org/10.1145/2882903.2882952
  39. Revisiting Multi-Pass Scatter and Gather on GPUs. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP ’18). Association for Computing Machinery, New York, NY, USA, Article 25, 11 pages. https://doi.org/10.1145/3225058.3225095
  40. Accelerating Multi-Way Joins on the GPU. The VLDB Journal 31, 3 (nov 2021), 529–553. https://doi.org/10.1007/s00778-021-00708-y
  41. The Art of Balance: A RateupDB™ Experience of Building a CPU/GPU Hybrid Database Product. Proc. VLDB Endow. 14, 12 (jul 2021), 2999–3013. https://doi.org/10.14778/3476311.3476378
  42. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. Proc. VLDB Endow. 9, 14 (oct 2016), 1647–1658. https://doi.org/10.14778/3007328.3007331
  43. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1633–1649. https://doi.org/10.1145/3318464.3389705
  44. Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1017–1032. https://doi.org/10.1145/3514221.3517911
  45. Optimizing main-memory join on modern hardware. IEEE Transactions on Knowledge and Data Engineering 14, 4 (July 2002), 709–730. https://doi.org/10.1109/TKDE.2002.1019210
  46. Cache-Conscious Radix-Decluster Projections. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB ’04). VLDB Endowment, 684–695.
  47. A Tensor Compiler for Unified Machine Learning Prediction Serving. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 899–917. https://www.usenix.org/conference/osdi20/presentation/nakandala
  48. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  49. Revisiting hash join on graphics processors: A decade later. Distributed and Parallel Databases 38 (2020), 771–793.
  50. GPL: A GPU-Based Pipelined Query Processing Engine. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 1935–1950. https://doi.org/10.1145/2882903.2915224
  51. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 1413–1425. https://doi.org/10.1145/3448016.3457254
  52. Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: An Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1981–1984. https://doi.org/10.1145/3299869.3320212
  53. Query Processing on Heterogeneous CPU/GPU Systems. ACM Comput. Surv. 55, 1, Article 11 (jan 2022), 38 pages. https://doi.org/10.1145/3485126
  54. Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (dec 2020), 708–720. https://doi.org/10.14778/3436905.3436927
  55. Ran Rui and Yi-Cheng Tu. 2017. Fast Equi-Join Algorithms on GPUs: Design and Implementation. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (Chicago, IL, USA) (SSDBM ’17). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. https://doi.org/10.1145/3085504.3085521
  56. Amazon Web Service. [n.d.]. AWS EC2 Instance Types. online. https://aws.amazon.com/ec2/instance-types/
  57. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.1145/3318464.3380595
  58. Cache Conscious Algorithms for Relational Query Processing. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 510–521.
  59. Materialization strategies in the Vertica analytic database: Lessons learned. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 1196–1207. https://doi.org/10.1109/ICDE.2013.6544909
  60. Hardware-Conscious Hash-Joins on GPUs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 698–709. https://doi.org/10.1109/ICDE.2019.00068
  61. An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs. In 2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW). 184–190. https://doi.org/10.1109/ICDEW58674.2023.00034
  62. Neil C. Thompson and Svenja Spanuth. 2021. The Decline of Computers as a General Purpose Technology. Commun. ACM 64, 3 (feb 2021), 64–72. https://doi.org/10.1145/3430936
  63. TPC. 2023a. TPC-DS Benchmark. http://www.tpc.org/tpcds/. Accessed: November 13, 2023.
  64. TPC. 2023b. TPC-H Benchmark. http://www.tpc.org/tpch/. Accessed: November 13, 2023.
  65. Relational Joins on GPUs: A Closer Look. IEEE Transactions on Parallel and Distributed Systems 28, 9 (2017), 2663–2673. https://doi.org/10.1109/TPDS.2017.2677451
  66. Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. Proc. VLDB Endow. 15, 11 (jul 2022), 2491–2503. https://doi.org/10.14778/3551793.3551809
  67. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow. 6, 10 (aug 2013), 817–828. https://doi.org/10.14778/2536206.2536210
  68. Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 1042–1057. https://doi.org/10.1145/3470496.3533044
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com