Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Systematic Survey of General Sparse Matrix-Matrix Multiplication (2002.11273v3)

Published 26 Feb 2020 in cs.DC

Abstract: General Sparse Matrix-Matrix Multiplication (SpGEMM) has attracted much attention from researchers in graph analyzing, scientific computing, and deep learning. Many optimization techniques have been developed for different applications and computing architectures over the past decades. The objective of this paper is to provide a structured and comprehensive overview of the researches on SpGEMM. Existing researches have been grouped into different categories based on target architectures and design choices. Covered topics include typical applications, compression formats, general formulations, key problems and techniques, architecture-oriented optimizations, and programming models. The rationales of different algorithms are analyzed and summarized. This survey sufficiently reveals the latest progress of SpGEMM research to 2021. Moreover, a thorough performance comparison of existing implementations is presented. Based on our findings, we highlight future research directions, which encourage better design and implementations in later studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (133)
  1. Kadir Akbudak and Cevdet Aykanat. 2014. Simultaneous Input and Output Matrix Partitioning for Outer-Product-Parallel Sparse Matrix-Matrix Multiplication. SIAM J. Sci. Comput. 36, 5 (2014), C568–C590. https://doi.org/10.1137/13092589X
  2. Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures. IEEE Trans. Parallel Distributed Syst. 28, 8 (2017), 2258–2271. https://doi.org/10.1109/TPDS.2017.2656893
  3. Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication. ACM Trans. Parallel Comput. 4, 3 (2018), 13:1–13:34. https://doi.org/10.1145/3155292
  4. Better Size Estimation for Sparse Matrix Products. Algorithmica 69, 3 (2014), 741–757. https://doi.org/10.1007/s00453-012-9692-9
  5. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, June 1-3, 2016, Ozcan Ozturk, Kemal Ebcioglu, Mahmut T. Kandemir, and Onur Mutlu (Eds.). ACM, 36:1–36:12. https://doi.org/10.1145/2925426.2926273
  6. OpenMP ARB. 2021. OpenMP: The OpenMP API specification for parallel programming. https://www.openmp.org/
  7. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
  8. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication. SIAM J. Sci. Comput. 38, 6 (2016), C624–C651. https://doi.org/10.1137/15M104253X
  9. Parallel Triangle Counting and Enumeration Using Matrix Algebra. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPS 2015, Hyderabad, India, May 25-29, 2015. IEEE Computer Society, 804–811. https://doi.org/10.1109/IPDPSW.2015.75
  10. Combinatorial BLAS 2.0: Scaling Combinatorial Algorithms on Distributed-Memory Systems. IEEE Trans. Parallel Distributed Syst. 33, 4 (2022), 989–1001. https://doi.org/10.1109/TPDS.2021.3094091
  11. Communication optimal parallel multiplication of sparse random matrices. In 25th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’13, Montreal, QC, Canada - July 23 - 25, 2013, Guy E. Blelloch and Berthold Vöcking (Eds.). ACM, 222–231. https://doi.org/10.1145/2486159.2486196
  12. Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication. ACM Trans. Parallel Comput. 3, 3 (2016), 18:1–18:34. https://doi.org/10.1145/3015144
  13. Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid. SIAM J. Sci. Comput. 38, 3 (2016), C203–C231. https://doi.org/10.1137/15M1028807
  14. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM J. Scientific Computing 34, 4 (2012), C123–C152. https://doi.org/10.1137/110838844
  15. Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Comput. 40, 5-6 (2014), 47–58. https://doi.org/10.1016/j.parco.2014.03.012
  16. A multigrid tutorial, Second Edition. SIAM.
  17. Aydin Buluç and John R. Gilbert. 2008a. Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication. In 2008 International Conference on Parallel Processing, ICPP 2008, September 8-12, 2008, Portland, Oregon, USA. IEEE Computer Society, 503–510. https://doi.org/10.1109/ICPP.2008.45
  18. Aydin Buluç and John R. Gilbert. 2008b. On the representation and multiplication of hypersparse matrices. In 22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, Miami, Florida USA, April 14-18, 2008. IEEE, 1–11. https://doi.org/10.1109/IPDPS.2008.4536313
  19. Aydin Buluç and John R. Gilbert. 2010. Highly Parallel Sparse Matrix-Matrix Multiplication. CoRR abs/1006.2183 (2010). arXiv:1006.2183 http://arxiv.org/abs/1006.2183
  20. Aydin Buluç and John R. Gilbert. 2011. The Combinatorial BLAS: design, implementation, and applications. International Journal of High Performance Computing Applications 25, 4 (2011), 496–509. https://doi.org/10.1177/1094342011403516
  21. Aydin Buluç and John R. Gilbert. 2012. Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments. SIAM J. Sci. Comput. 34, 4 (2012), C170–C191. https://doi.org/10.1137/110848244
  22. Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011, Scott A. Lathrop, Jim Costa, and William Kramer (Eds.). ACM, 65:1–65:12. https://doi.org/10.1145/2063384.2063471
  23. Lynn Elliot Cannon. 1969. A cellular computer to implement the Kalman filter algorithm. Montana State University.
  24. Hypergraph Partitioning. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer US, Boston, MA, 871–881. https://doi.org/10.1007/978-0-387-09766-4_1
  25. Algebraic Methods in the Congested Clique. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing, PODC 2015, Donostia-San Sebastián, Spain, July 21 - 23, 2015, Chryssis Georgiou and Paul G. Spirakis (Eds.). ACM, 143–152. https://doi.org/10.1145/2767386.2767414
  26. Timothy M. Chan. 2007. More Algorithms for All-pairs Shortest Paths in Weighted Graphs. In Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing (San Diego, California, USA) (STOC ’07). ACM, New York, NY, USA, 590–598. https://doi.org/10.1145/1250790.1250877
  27. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid State Circuits 52, 1 (2017), 127–138. https://doi.org/10.1109/JSSC.2016.2616357
  28. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer. IEEE Trans. Parallel Distributed Syst. 30, 4 (2019), 923–938. https://doi.org/10.1109/TPDS.2018.2871189
  29. Optimizing partitioned CSR-based SpGEMM on the Sunway TaihuLight. Neural Comput. Appl. 32, 10 (2020), 5571–5582. https://doi.org/10.1007/s00521-019-04121-z
  30. Edith Cohen. 1997. Size-Estimation Framework with Applications to Transitive Closure and Reachability. J. Comput. Syst. Sci. 55, 3 (1997), 441–453. https://doi.org/10.1006/jcss.1997.1534
  31. Edith Cohen. 1998. Structure Prediction and Computation of Sparse Matrix Products. J. Comb. Optim. 2, 4 (1998), 307–332. https://doi.org/10.1023/A:1009716300509
  32. MAD Skills: New Analysis Practices for Big Data. Proc. VLDB Endow. 2, 2 (2009), 1481–1492. https://doi.org/10.14778/1687553.1687576
  33. Jonathan D. Cohen. 2009. Graph Twiddling in a MapReduce World. Comput. Sci. Eng. 11, 4 (2009), 29–41. https://doi.org/10.1109/MCSE.2009.120
  34. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. http://cusplibrary.github.io/ , v0.5.0.
  35. Optimizing Sparse Matrix - Matrix Multiplication for the GPU. ACM Trans. Math. Softw. 41, 4 (2015), 25:1–25:20. https://doi.org/10.1145/2699470
  36. Timothy A. Davis. 2018. Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and K-truss. In 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018, Waltham, MA, USA, September 25-27, 2018. IEEE, 1–6. https://doi.org/10.1109/HPEC.2018.8547538
  37. Timothy A. Davis. 2019. Algorithm 1000: SuiteSparse: GraphBLAS: Graph Algorithms in the Language of Sparse Linear Algebra. ACM Trans. Math. Softw. 45, 4 (2019), 44:1–44:25. https://doi.org/10.1145/3322125
  38. Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011), 1:1–1:25. https://doi.org/10.1145/2049662.2049663
  39. Gunduz Vehbi Demirci and Cevdet Aykanat. 2020a. Cartesian Partitioning Models for 2D and 3D Parallel SpGEMM Algorithms. IEEE Trans. Parallel Distributed Syst. 31, 12 (2020), 2763–2775. https://doi.org/10.1109/TPDS.2020.3000708
  40. Gunduz Vehbi Demirci and Cevdet Aykanat. 2020b. Scaling sparse matrix-matrix multiplication in the accumulo database. Distributed Parallel Databases 38, 1 (2020), 31–62. https://doi.org/10.1007/s10619-019-07257-y
  41. Julien Demouth. 2012. Sparse matrix-matrix multiplication on the GPU. In GPU Technology Conference 2012.
  42. Parallel Graph Coloring for Manycore Architectures. In 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23-27, 2016. IEEE Computer Society, 892–901. https://doi.org/10.1109/IPDPS.2016.54
  43. Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments. CoRR abs/1804.00695 (2018). arXiv:1804.00695 http://arxiv.org/abs/1804.00695
  44. Hypergraph Sparsification and Its Application to Partitioning. In 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, October 1-4, 2013. IEEE Computer Society, 200–209. https://doi.org/10.1109/ICPP.2013.29
  45. Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2017, Orlando / Buena Vista, FL, USA, May 29 - June 2, 2017. 693–702. https://doi.org/10.1109/IPDPSW.2017.8
  46. Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures. CoRR abs/1801.03065 (2018). arXiv:1801.03065 http://arxiv.org/abs/1801.03065
  47. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distributed Comput. 74, 12 (2014), 3202–3216. https://doi.org/10.1016/j.jpdc.2014.07.003
  48. James J Elliott and Christopher M Siefert. 2018. Low Thread-Count Gustavson: A Multithreaded Algorithm for Sparse Matrix-Matrix Multiplication Using Perfect Hashing. In 2018 IEEE/ACM 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (scalA). 57–64. https://doi.org/10.1109/ScalA.2018.00011
  49. Robert D. Falgout. 2006. An Introduction to Algebraic Multigrid. Comput. Sci. Eng. 8, 6 (2006), 24–33. https://doi.org/10.1109/MCSE.2006.105
  50. Sparse Matrix-Vector Multiplication on GPGPUs. ACM Trans. Math. Softw. 43, 4 (2017), 30:1–30:49. https://doi.org/10.1145/3017994
  51. Graphulo: Linear Algebra Graph Kernels for NoSQL Databases. CoRR abs/1508.07372 (2015). arXiv:1508.07372 http://arxiv.org/abs/1508.07372
  52. Sparse Matrices in MATLAB: Design and Implementation. SIAM J. Matrix Anal. Appl. 13, 1 (1992), 333–356. https://doi.org/10.1137/0613024
  53. High-Performance Graph Algorithms from Parallel Sparse Matrices. In Applied Parallel Computing. State of the Art in Scientific Computing, 8th International Workshop, PARA 2006, Umeå, Sweden, June 18-21, 2006, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 4699), Bo Kågström, Erik Elmroth, Jack J. Dongarra, and Jerzy Wasniewski (Eds.). Springer, 260–269. https://doi.org/10.1007/978-3-540-75755-9_32
  54. A Unified Framework for Numerical and Combinatorial Computing. Comput. Sci. Eng. 10, 2 (2008), 20–25. https://doi.org/10.1109/MCSE.2008.45
  55. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 151–165. https://doi.org/10.1145/3352460.3358291
  56. Graphegon. 2021. Pygraphblas. https://github.com/Graphegon/pygraphblas. Online; accessed 8 July 2022.
  57. GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging. SIAM J. Sci. Comput. 37, 1 (2015). https://doi.org/10.1137/130948811
  58. Bandwidth Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking. In SPAA ’20: 32nd ACM Symposium on Parallelism in Algorithms and Architectures, Virtual Event, USA, July 15-17, 2020, Christian Scheideler and Michael Spear (Eds.). ACM, 293–303. https://doi.org/10.1145/3350755.3400216
  59. BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper. In Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms, ACDA 2021, Virtual Conference, July 19-21, 2021, Michael Bender, John Gilbert, Bruce Hendrickson, and Blair D. Sullivan (Eds.). SIAM, 123–134. https://doi.org/10.1137/1.9781611976830.12
  60. Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 517–526. https://doi.org/10.1109/IPDPS49936.2021.00060
  61. Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw. 4, 3 (1978), 250–269. https://doi.org/10.1145/355791.355796
  62. FP-AMG: FPGA-Based Acceleration Framework for Algebraic Multigrid Solvers. In 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2020, Fayetteville, AR, USA, May 3-6, 2020. IEEE, 148–156. https://doi.org/10.1109/FCCM48280.2020.00028
  63. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. IEEE Computer Society, 243–254. https://doi.org/10.1109/ISCA.2016.30
  64. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 319–333. https://doi.org/10.1145/3352460.3358275
  65. An overview of the Trilinos project. ACM Trans. Math. Softw. 31, 3 (2005), 397–423. https://doi.org/10.1145/1089014.1089021
  66. Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 90–100. https://doi.org/10.1109/IPDPS49936.2021.00018
  67. Graphulo implementation of server-side sparse matrix multiply in the Accumulo database. In 2015 IEEE High Performance Extreme Computing Conference, HPEC 2015, Waltham, MA, USA, September 15-17, 2015. IEEE, 1–7. https://doi.org/10.1109/HPEC.2015.7322448
  68. Intel. 2021. Intel Math Kernel Library. https://software.intel.com/en-us/mkl
  69. Performance Evaluation of Accurate Matrix-Matrix Multiplication on GPU Using Sparse Matrix Multiplications. In Eighth International Symposium on Computing and Networking Workshops, CANDAR 2020 Workshops, Naha, Japan, November 24-27, 2020. IEEE, 178–184. https://doi.org/10.1109/CANDARW51189.2020.00044
  70. The Algorithms for FPGA Implementation of Sparse Matrices Multiplication. Comput. Informatics 33, 3 (2014), 667–684. http://www.cai.sk/ojs/index.php/cai/article/view/2795
  71. Dejiang Jin and Sotirios G. Ziavras. 2004. A Super-Programming Technique for Large Sparse Matrix Multiplication on PC Clusters. IEICE Trans. Inf. Syst. 87-D, 7 (2004), 1774–1781. http://search.ieice.org/bin/summary.php?id=e87-d_7_1774
  72. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 600–614. https://doi.org/10.1145/3352460.3358286
  73. Colored intersection searching via sparse rectangular matrix multiplication. In Proceedings of the 22nd ACM Symposium on Computational Geometry, Sedona, Arizona, USA, June 5-7, 2006, Nina Amenta and Otfried Cheong (Eds.). ACM, 52–60. https://doi.org/10.1145/1137856.1137866
  74. Barbara Ann Kitchenham. 2004. Procedures for Performing Systematic Reviews. Technical Report. Keele University, Department of Computer Science, Keele University, Kelee, UK. http://www.it.hiof.no/~haraldh/misc/2016-08-22-smat/Kitchenham-Systematic-Review-2004.pdf
  75. Characterization of Data Movement Requirements for Sparse Matrix Computations on GPUs. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18-21, 2017. IEEE Computer Society, 283–293. https://doi.org/10.1109/HiPC.2017.00040
  76. Ralf Lämmel. 2008. Google’s MapReduce programming model - Revisited. Sci. Comput. Program. 70, 1 (2008), 1–30. https://doi.org/10.1016/j.scico.2007.07.001
  77. Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 925–936. https://doi.org/10.1109/ICDE48307.2020.00085
  78. Generalized Sparse Matrix-Matrix Multiplication for Vector Engines and Graph Applications. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing, MCHPC@SC 2019, Denver, CO, USA, November 18, 2019. IEEE, 33–42. https://doi.org/10.1109/MCHPC49590.2019.00012
  79. Design space exploration for sparse matrix-matrix multiplication on FPGAs. Int. J. Circuit Theory Appl. 41, 2 (2013), 205–219. https://doi.org/10.1002/cta.796
  80. Register-Aware Optimizations for Parallel Sparse Matrix-Matrix Multiplication. Int. J. Parallel Program. 47, 3 (2019), 403–417. https://doi.org/10.1007/s10766-018-0604-8
  81. Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory. In PPoPP ’21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, Jaejin Lee and Erez Petrank (Eds.). ACM, 318–333. https://doi.org/10.1145/3437801.3441581
  82. Weifeng Liu and Brian Vinter. 2014. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014. IEEE Computer Society, 370–381. https://doi.org/10.1109/IPDPS.2014.47
  83. Weifeng Liu and Brian Vinter. 2015. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. J. Parallel Distributed Comput. 85 (2015), 47–61. https://doi.org/10.1016/j.jpdc.2015.06.010
  84. Sparse matrix-matrix multiplication on modern architectures. In 19th International Conference on High Performance Computing, HiPC 2012, Pune, India, December 18-22, 2012. IEEE Computer Society, 1–10. https://doi.org/10.1109/HiPC.2012.6507483
  85. Duane Merrill and Andrew S. Grimshaw. 2011. High Performance and Scalable Radix Sorting: a Case Study of Implementing Dynamic Parallelism for GPU Computing. Parallel Process. Lett. 21, 2 (2011), 245–272. https://doi.org/10.1142/S0129626411000187
  86. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Comput. 90 (2019). https://doi.org/10.1016/j.parco.2019.102545
  87. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU. In 46th International Conference on Parallel Processing, ICPP 2017, Bristol, United Kingdom, August 14-17, 2017. IEEE Computer Society, 101–110. https://doi.org/10.1109/ICPP.2017.19
  88. AmgX: A Library for GPU Accelerated Algebraic Multigrid and Preconditioned Iterative Methods. SIAM J. Sci. Comput. 37, 5 (2015), S602–S626. https://doi.org/10.1137/140980260
  89. TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In PPoPP ’22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 - 6, 2022, Jaejin Lee, Kunal Agrawal, and Michael F. Spear (Eds.). ACM, 90–106. https://doi.org/10.1145/3503221.3508431
  90. NVIDIA. 2021. Nvidia cuSPARSE library. https://developer.nvidia.com/cusparse
  91. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018. IEEE Computer Society, 724–736. https://doi.org/10.1109/HPCA.2018.00067
  92. spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis. In PPoPP ’20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, California, USA, February 22-26, 2020, Rajiv Gupta and Xipeng Shen (Eds.). ACM, 362–375. https://doi.org/10.1145/3332466.3374521
  93. AutoRelax: HW-SW Co-Optimization for Efficient SpGEMM Operations With Automated Relaxation in Deep Learning. IEEE Trans. Emerg. Top. Comput. 10, 3 (2022), 1428–1442. https://doi.org/10.1109/TETC.2021.3089848
  94. Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms. In High Performance Computing - 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, Proceedings (Lecture Notes in Computer Science, Vol. 9137), Julian M. Kunkel and Thomas Ludwig (Eds.). Springer, 48–57. https://doi.org/10.1007/978-3-319-20119-1_4
  95. MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 769–781. https://doi.org/10.1145/3352460.3358258
  96. Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges 23, 4 (2008), 298–298.
  97. Extending Sparse Tensor Accelerators to Support Multiple Compression Formats. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 1014–1024. https://doi.org/10.1109/IPDPS49936.2021.00110
  98. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 58–70. https://doi.org/10.1109/HPCA47549.2020.00015
  99. Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels. CoRR abs/2103.11991 (2021). arXiv:2103.11991 https://arxiv.org/abs/2103.11991
  100. CiM3D: Comparator-in-Memory Designs Using Monolithic 3-D Technology for Accelerating Data-Intensive Applications. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits 7, 1 (2021), 79–87. https://doi.org/10.1109/JXCDC.2021.3087745
  101. Monolithic 3D+-IC Based Massively Parallel Compute-in-Memory Macro for Accelerating Database and Machine Learning Primitives. In 2020 IEEE International Electron Devices Meeting (IEDM). 28.5.1–28.5.4. https://doi.org/10.1109/IEDM13553.2020.9372111
  102. A Compressed, Divide and Conquer Algorithm for Scalable Distributed Matrix-Matrix Multiplication. In HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region, Virtual Event, Republic of Korea, January 20-21, 2021, Soonwook Hwang and Heon Young Yeom (Eds.). ACM, 110–119. https://doi.org/10.1145/3432261.3432271
  103. Optimizing Memory-Compute Colocation for Irregular Applications on a Migratory Thread Architecture. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 58–67. https://doi.org/10.1109/IPDPS49936.2021.00015
  104. Emanuel H. Rubensson and Elias Rudberg. 2014. Chunks and Tasks: A programming model for parallelization of dynamic algorithms. Parallel Comput. 40, 7 (2014), 328–343. https://doi.org/10.1016/j.parco.2013.09.006
  105. Emanuel H. Rubensson and Elias Rudberg. 2016. Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model. Parallel Comput. 57 (2016), 87–106. https://doi.org/10.1016/j.parco.2016.06.005
  106. PyTrilinos: High-performance distributed-memory solvers for Python. ACM Trans. Math. Softw. 34, 2 (2008), 7:1–7:33. https://doi.org/10.1145/1326548.1326549
  107. Locality-aware and load-balanced static task scheduling for MapReduce. Future Gener. Comput. Syst. 90 (2019), 49–61. https://doi.org/10.1016/j.future.2018.06.035
  108. Distributed many-to-many protein sequence alignment using sparse matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 75. https://doi.org/10.1109/SC41405.2020.00079
  109. Optimizing High Performance Markov Clustering for Pre-Exascale Architectures. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18-22, 2020. IEEE, 116–126. https://doi.org/10.1109/IPDPS47924.2020.00022
  110. Kaustubh Shivdikar. 2021. SMASH: Sparse Matrix Atomic Scratchpad Hashing. CoRR abs/2105.14156 (2021). arXiv:2105.14156 https://arxiv.org/abs/2105.14156
  111. Efficient sparse matrix-matrix multiplication on heterogeneous high performance systems. In 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS). 1–8. https://doi.org/10.1109/CLUSTERWKSP.2010.5613109
  112. A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel Distributed Comput. 74, 12 (2014), 3176–3190. https://doi.org/10.1016/j.jpdc.2014.06.002
  113. Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra. CoRR abs/2004.13907 (2020). arXiv:2004.13907 https://arxiv.org/abs/2004.13907
  114. MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams. ACM Trans. Archit. Code Optim. 16, 4 (2020), 35:1–35:26. https://doi.org/10.1145/3355396
  115. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020. IEEE, 766–780. https://doi.org/10.1109/MICRO50266.2020.00068
  116. The More the Merrier: Efficient Multi-Source Graph Traversal. Proc. VLDB Endow. 8, 4 (2014), 449–460. https://doi.org/10.14778/2735496.2735507
  117. The Trilinos Project Team. 2020. The Trilinos Home Page. https://trilinos.github.io. Online; (acccessed July 8, 2022).
  118. Robert A. van de Geijn and Jerrell Watts. 1997. SUMMA: scalable universal matrix multiplication algorithm. Concurr. Pract. Exp. 9, 4 (1997), 255–274. https://doi.org/10.1002/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2
  119. Finding heaviest H-subgraphs in real weighted graphs, with applications. CoRR abs/cs/0609009 (2006). arXiv:cs/0609009 http://arxiv.org/abs/cs/0609009
  120. Accelerating DNN Inference with GraphBLAS and the GPU. In 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019, Waltham, MA, USA, September 24-26, 2019. IEEE, 1–6. https://doi.org/10.1109/HPEC.2019.8916498
  121. Semiempirical Molecular Dynamics (SEMD) I: Midpoint-Based Parallel Sparse Matrix-Matrix Multiplication Algorithm for Matrices with Decay. Journal of Chemical Theory and Computation 11, 7 (2015), 3145–3152. https://doi.org/10.1021/acs.jctc.5b00382 arXiv:https://doi.org/10.1021/acs.jctc.5b00382 PMID: 26575751.
  122. Adaptive sparse matrix-matrix multiplication on the GPU. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 68–81. https://doi.org/10.1145/3293883.3295701
  123. A task-based linear algebra Building Blocks approach for scalable graph analytics. In 2015 IEEE High Performance Extreme Computing Conference, HPEC 2015, Waltham, MA, USA, September 15-17, 2015. IEEE, 1–6. https://doi.org/10.1109/HPEC.2015.7322450
  124. Fast linear algebra-based triangle counting with KokkosKernels. In 2017 IEEE High Performance Extreme Computing Conference, HPEC 2017, Waltham, MA, USA, September 12-14, 2017. IEEE, 1–7. https://doi.org/10.1109/HPEC.2017.8091043
  125. Scaling Sparse Matrix Multiplication on CPU-GPU Nodes. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 392–401. https://doi.org/10.1109/IPDPS49936.2021.00047
  126. Jiaming Xie and Yun Liang. 2019. SPART: Optimizing CNNs by Utilizing Both Sparsity of Weights and Feature Maps. In Advanced Parallel Processing Technologies - 13th International Symposium, APPT 2019, Tianjin, China, August 15-16, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11719), Pen-Chung Yew, Per Stenström, Junjie Wu, Xiaoli Gong, and Tao Li (Eds.). Springer, 71–85. https://doi.org/10.1007/978-3-030-29611-7_6
  127. Fast Triangle Counting Using Cilk. In 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018, Waltham, MA, USA, September 25-27, 2018. IEEE, 1–7. https://doi.org/10.1109/HPEC.2018.8547563
  128. Raphael Yuster and Uri Zwick. 2005. Fast sparse matrix multiplication. ACM Trans. Algorithms 1, 1 (2005), 2–13. https://doi.org/10.1145/1077464.1077466
  129. Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores. Comput. Electr. Eng. 88 (2020), 106848. https://doi.org/10.1016/j.compeleceng.2020.106848
  130. Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors. CCF Trans. High Perform. Comput. 1, 2 (2019), 131–143. https://doi.org/10.1007/s42514-019-00008-6
  131. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In ASPLOS ’21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual Event, USA, April 19-23, 2021, Tim Sherwood, Emery D. Berger, and Christos Kozyrakis (Eds.). ACM, 687–701. https://doi.org/10.1145/3445814.3446702
  132. A novel algorithm for all pairs shortest path problem based on matrix multiplication and pulse coupled neural network. Digit. Signal Process. 21, 4 (2011), 517–521. https://doi.org/10.1016/j.dsp.2011.02.004
  133. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 261–274. https://doi.org/10.1109/HPCA47549.2020.00030
Citations (44)

Summary

We haven't generated a summary for this paper yet.