Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs (2311.18141v2)

Published 29 Nov 2023 in cs.DC

Abstract: Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, “Parallel efficient sparse matrix-matrix multiplication on multicore platforms,” in ISC.   Springer, 2015, pp. 48–57.
  2. E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi,” in PPAM.   Springer, 2013, pp. 559–570.
  3. C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the GPU,” in EuroPar.   Springer, 2018, pp. 672–687.
  4. G. Schubert, H. Fehske, G. Hager, and G. Wellein, “Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems,” Parallel Processing Letters, vol. 21, no. 03, pp. 339–358, 2011.
  5. S. Acer, O. Selvitopi, and C. Aykanat, “Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,” Parallel Computing, vol. 59, pp. 71–96, 2016.
  6. C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in PPOPP, 2019, pp. 300–314.
  7. E. Solomonik and T. Hoefler, “Sparse tensor algebra as a parallel programming model,” arXiv preprint arXiv:1512.00066, 2015.
  8. Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking,” in SPAA, 2020, pp. 293–303.
  9. R. A. Van De Geijn and J. Watts, “Summa: Scalable universal matrix multiplication algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997.
  10. D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-mat: A recursive model for graph mining,” in SDM.   SIAM, 2004, pp. 442–446.
  11. G. M. Slota, S. Rajamanickam, and K. Madduri, “Order or shuffle: Empirically evaluating vertex order impact on parallel graph computations,” in IPDPSW.   IEEE, 2017, pp. 588–597.
  12. A. Azad, A. Buluç, X. S. Li, X. Wang, and J. Langguth, “A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs,” SIAM Journal on Scientific Computing, vol. 42, no. 4, pp. C143–C168, 2020.
  13. G. Huang, G. Dai, Y. Wang, and H. Yang, “GE-SpMM: General-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks,” in SC’20, 2020.
  14. A. Tripathy, K. Yelick, and A. Buluç, “Reducing communication in graph neural network training,” in SC’20, 2020, pp. 1–17.
  15. Y. Hu, Z. Ye, M. Wang, J. Yu, D. Zheng, M. Li, Z. Zhang, Z. Zhang, and Y. Wang, “FeatGraph: A flexible and efficient backend for graph neural network systems,” in SC’20, 2020.
  16. A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design, implementation, and applications,” The Intl. Journal of High Performance Comp. Applications, vol. 25, no. 4, pp. 496–509, 2011.
  17. A. Azad, A. Buluç, and J. Gilbert, “Parallel triangle counting and enumeration using matrix algebra,” in IPDPSW.   IEEE, 2015, pp. 804–811.
  18. S. v. Dongen, “Graph clustering by flow simulation,” PhD thesis, University of Utrecht, 2000.
  19. A. Bustamam, K. Burrage, and N. A. Hamilton, “Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format,” IEEE/ACM TCBB, vol. 9, no. 3, pp. 679–692, 2012.
  20. T. A. Davis, “Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss,” in HPEC.   IEEE, 2018, pp. 1–6.
  21. U. Borštnik, J. VandeVondele, V. Weber, and J. Hutter, “Sparse matrix multiplication: The distributed block-compressed sparse row library,” Parallel Computing, vol. 40, no. 5-6, pp. 47–58, 2014.
  22. B. Brock, A. Buluç, and K. Yelick, “BCL: A cross-platform distributed data structures library,” in ICPP, 2019.
  23. J. Bachan, D. Bonachea, P. H. Hargrove, S. Hofmeyr, M. Jacquelin, A. Kamil, B. van Straalen, and S. B. Baden, “The UPC++ PGAS library for exascale computing,” in Proceedings of the Second Annual PGAS Applications Workshop.   ACM, 2017, p. 7.
  24. K. Fürlinger, T. Fuchs, and R. Kowalewski, “DASH: A C++ PGAS library for distributed data structures and parallel algorithms,” in HPCC, Sydney, Australia, Dec. 2016, pp. 983–990.
  25. S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, “PETSc Web page,” https://www.mcs.anl.gov/petsc, 2021. [Online]. Available: https://www.mcs.anl.gov/petsc
  26. S. van Dongen, “Graph clustering by flow simulation,” Ph.D. dissertation, University of Utrecht, 2000.
  27. G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluc, “Parallel string graph construction and transitive reduction for de novo genome assembly,” in IPDPS.   IEEE, 2021.
  28. F. Mössbauer, R. Kowalewski, T. Fuchs, and K. Fürlinger, “A portable multidimensional coarray for C++,” in PDP, Cambridge, UK, Mar. 2018.
  29. A. Azad, O. Selvitopi, M. T. Hussain, J. Gilbert, and A. Buluç, “Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems,” IEEE Transactions on Parallel and Distributed Systems, 2021.
  30. E. Solomonik, D. Matthews, J. Hammond, and J. Demmel, “Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions,” in IPDPS.   IEEE, 2013, pp. 813–824.
  31. A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, “Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication,” SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624–C651, 2016.

Summary

We haven't generated a summary for this paper yet.