RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs (2311.18141v2)
Abstract: Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.
- M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, “Parallel efficient sparse matrix-matrix multiplication on multicore platforms,” in ISC. Springer, 2015, pp. 48–57.
- E. Saule, K. Kaya, and Ü. V. Çatalyürek, “Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi,” in PPAM. Springer, 2013, pp. 559–570.
- C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the GPU,” in EuroPar. Springer, 2018, pp. 672–687.
- G. Schubert, H. Fehske, G. Hager, and G. Wellein, “Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems,” Parallel Processing Letters, vol. 21, no. 03, pp. 339–358, 2011.
- S. Acer, O. Selvitopi, and C. Aykanat, “Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,” Parallel Computing, vol. 59, pp. 71–96, 2016.
- C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in PPOPP, 2019, pp. 300–314.
- E. Solomonik and T. Hoefler, “Sparse tensor algebra as a parallel programming model,” arXiv preprint arXiv:1512.00066, 2015.
- Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking,” in SPAA, 2020, pp. 293–303.
- R. A. Van De Geijn and J. Watts, “Summa: Scalable universal matrix multiplication algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997.
- D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-mat: A recursive model for graph mining,” in SDM. SIAM, 2004, pp. 442–446.
- G. M. Slota, S. Rajamanickam, and K. Madduri, “Order or shuffle: Empirically evaluating vertex order impact on parallel graph computations,” in IPDPSW. IEEE, 2017, pp. 588–597.
- A. Azad, A. Buluç, X. S. Li, X. Wang, and J. Langguth, “A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs,” SIAM Journal on Scientific Computing, vol. 42, no. 4, pp. C143–C168, 2020.
- G. Huang, G. Dai, Y. Wang, and H. Yang, “GE-SpMM: General-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks,” in SC’20, 2020.
- A. Tripathy, K. Yelick, and A. Buluç, “Reducing communication in graph neural network training,” in SC’20, 2020, pp. 1–17.
- Y. Hu, Z. Ye, M. Wang, J. Yu, D. Zheng, M. Li, Z. Zhang, Z. Zhang, and Y. Wang, “FeatGraph: A flexible and efficient backend for graph neural network systems,” in SC’20, 2020.
- A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design, implementation, and applications,” The Intl. Journal of High Performance Comp. Applications, vol. 25, no. 4, pp. 496–509, 2011.
- A. Azad, A. Buluç, and J. Gilbert, “Parallel triangle counting and enumeration using matrix algebra,” in IPDPSW. IEEE, 2015, pp. 804–811.
- S. v. Dongen, “Graph clustering by flow simulation,” PhD thesis, University of Utrecht, 2000.
- A. Bustamam, K. Burrage, and N. A. Hamilton, “Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format,” IEEE/ACM TCBB, vol. 9, no. 3, pp. 679–692, 2012.
- T. A. Davis, “Graph algorithms via suitesparse: Graphblas: triangle counting and k-truss,” in HPEC. IEEE, 2018, pp. 1–6.
- U. Borštnik, J. VandeVondele, V. Weber, and J. Hutter, “Sparse matrix multiplication: The distributed block-compressed sparse row library,” Parallel Computing, vol. 40, no. 5-6, pp. 47–58, 2014.
- B. Brock, A. Buluç, and K. Yelick, “BCL: A cross-platform distributed data structures library,” in ICPP, 2019.
- J. Bachan, D. Bonachea, P. H. Hargrove, S. Hofmeyr, M. Jacquelin, A. Kamil, B. van Straalen, and S. B. Baden, “The UPC++ PGAS library for exascale computing,” in Proceedings of the Second Annual PGAS Applications Workshop. ACM, 2017, p. 7.
- K. Fürlinger, T. Fuchs, and R. Kowalewski, “DASH: A C++ PGAS library for distributed data structures and parallel algorithms,” in HPCC, Sydney, Australia, Dec. 2016, pp. 983–990.
- S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, “PETSc Web page,” https://www.mcs.anl.gov/petsc, 2021. [Online]. Available: https://www.mcs.anl.gov/petsc
- S. van Dongen, “Graph clustering by flow simulation,” Ph.D. dissertation, University of Utrecht, 2000.
- G. Guidi, O. Selvitopi, M. Ellis, L. Oliker, K. Yelick, and A. Buluc, “Parallel string graph construction and transitive reduction for de novo genome assembly,” in IPDPS. IEEE, 2021.
- F. Mössbauer, R. Kowalewski, T. Fuchs, and K. Fürlinger, “A portable multidimensional coarray for C++,” in PDP, Cambridge, UK, Mar. 2018.
- A. Azad, O. Selvitopi, M. T. Hussain, J. Gilbert, and A. Buluç, “Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems,” IEEE Transactions on Parallel and Distributed Systems, 2021.
- E. Solomonik, D. Matthews, J. Hammond, and J. Demmel, “Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions,” in IPDPS. IEEE, 2013, pp. 813–824.
- A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, “Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication,” SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624–C651, 2016.