A Systematic Literature Survey of Sparse Matrix-Vector Multiplication (2404.06047v1)
Abstract: Sparse matrix-vector multiplication (SpMV) is a crucial computing kernel with widespread applications in iterative algorithms. Over the past decades, research on SpMV optimization has made remarkable strides, giving rise to various optimization contributions. However, the comprehensive and systematic literature survey that introduces, analyzes, discusses, and summarizes the advancements of SpMV in recent years is currently lacking. Aiming to fill this gap, this paper compares existing techniques and analyzes their strengths and weaknesses. We begin by highlighting two representative applications of SpMV, then conduct an in-depth overview of the important techniques that optimize SpMV on modern architectures, which we specifically classify as classic, auto-tuning, machine learning, and mixed-precision-based optimization. We also elaborate on the hardware-based architectures, including CPU, GPU, FPGA, processing in Memory, heterogeneous, and distributed platforms. We present a comprehensive experimental evaluation that compares the performance of state-of-the-art SpMV implementations. Based on our findings, we identify several challenges and point out future research directions. This survey is intended to provide researchers with a comprehensive understanding of SpMV optimization on modern architectures and provide guidance for future work.
- M. A. V. Kulkarni and P. Barde, “A survey on performance modelling and optimization techniques for spmv on gpus,” Int. J. Comput. Sci. Inf. Technol, vol. 5, pp. 7577–7582, 2014.
- M. Grossman, C. Thiele, M. Araya-Polo, F. Frank, F. O. Alpak, and V. Sarkar, “A survey of sparse matrix-vector multiplication performance on large matrices,” arXiv preprint arXiv:1608.00636, 2016.
- S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, “Sparse matrix-vector multiplication on gpgpus,” ACM Trans. Math. Softw., vol. 43, no. 4, pp. 30:1–30:49, 2017. [Online]. Available: https://doi.org/10.1145/3017994
- Q. Wang, M. Li, J. Pang, and D. Zhu, “Research on performance optimization for sparse matrix-vector multiplication in multi/many-core architecture,” in 2020 2nd International Conference on Information Technology and Computer Application (ITCA). IEEE, 2020, pp. 350–362.
- G. Xiao, C. Yin, T. Zhou, X. Li, Y. Chen, and K. Li, “A survey of accelerating parallel sparse linear algebra,” ACM Comput. Surv., vol. 56, no. 1, aug 2023. [Online]. Available: https://doi.org/10.1145/3604606
- X. Fu, B. Zhang, T. Wang, W. Li, Y. Lu, E. Yi, J. Zhao, X. Geng, F. Li, J. Zhang, Z. Jin, and W. Liu, “Pangulu: A scalable regular two-dimensional block-cyclic sparse direct solver on distributed heterogeneous systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, D. Arnold, R. M. Badia, and K. M. Mohror, Eds. ACM, 2023, pp. 51:1–51:14. [Online]. Available: https://doi.org/10.1145/3581784.3607050
- Y. Saad and M. H. Schultz, “Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM Journal on scientific and statistical computing, vol. 7, no. 3, pp. 856–869, 1986.
- K. Suzuki, T. Fukaya, and T. Iwashita, “A novel ILU preconditioning method with a block structure suitable for SIMD vectorization,” J. Comput. Appl. Math., vol. 419, p. 114687, 2023. [Online]. Available: https://doi.org/10.1016/j.cam.2022.114687
- X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q. Hua, “Graph processing on gpus: A survey,” ACM Comput. Surv., vol. 50, no. 6, pp. 81:1–81:35, 2018. [Online]. Available: https://doi.org/10.1145/3128571
- H. Tong, C. Faloutsos, and J. Pan, “Random walk with restart: fast solutions and applications,” Knowl. Inf. Syst., vol. 14, no. 3, pp. 327–346, 2008. [Online]. Available: https://doi.org/10.1007/s10115-007-0094-2
- A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, “Fast sparse matrix-vector multiplication on gpus for graph applications,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16-21, 2014, T. Damkroger and J. J. Dongarra, Eds. IEEE Computer Society, 2014, pp. 781–792. [Online]. Available: https://doi.org/10.1109/SC.2014.69
- T. Wu, B. Wang, Y. Shan, F. Yan, Y. Wang, and N. Xu, “Efficient pagerank and spmv computation on AMD gpus,” in 39th International Conference on Parallel Processing, ICPP 2010, San Diego, California, USA, 13-16 September 2010. IEEE Computer Society, 2010, pp. 81–89. [Online]. Available: https://doi.org/10.1109/ICPP.2010.17
- J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, no. 5, pp. 604–632, 1999. [Online]. Available: https://doi.org/10.1145/324133.324140
- S. Brin and L. Page, “Reprint of: The anatomy of a large-scale hypertextual web search engine,” Comput. Networks, vol. 56, no. 18, pp. 3825–3833, 2012. [Online]. Available: https://doi.org/10.1016/j.comnet.2012.10.007
- G. V. Paolini and G. R. D. Brozolo, “Data structures to vectorize cg algorithms for general sparsity patterns,” BIT Numerical Mathematics, vol. 29, no. 4, pp. 703–718, 1989.
- A. Peters, “Sparse matrix vector multiplication techniques on the IBM 3090 VF,” Parallel Comput., vol. 17, no. 12, pp. 1409–1424, 1991. [Online]. Available: https://doi.org/10.1016/S0167-8191(05)80007-9
- Y. Saad, “SPARSKIT: A Basic Took Kit for Sparse Matrix Computations, Version 2,” http://www. cs. umn. edu/saad/software/SPARSKIT/sparskit. html, 1994.
- N. Bell and M. Garland, “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors,” in Proceedings of the ACM/IEEE Conference on High Performance Computing, SC, November 14-20, 2009, Portland, Oregon, USA. ACM, 2009, pp. 1–11. [Online]. Available: https://doi.org/10.1145/1654059.1654078
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan Primitives for GPU Computing,” in Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, ser. GH’07. Goslar, DEU: Eurographics Association, 2007, p. 97–106.
- M. Garland, “Sparse Matrix Computations on Manycore GPU’s,” in Proceedings of the 45th annual Design Automation Conference, ser. DAC ’08. New York, NY, USA: Association for Computing Machinery, Jun 2008, p. 2–6. [Online]. Available: https://doi.org/10.1145/1391469.1391473
- H. Dang and B. Schmidt, “The Sliced COO Format for Sparse Matrix-Vector Multiplication on CUDA-Enabled GPUs,” in Proceedings of the International Conference on Computational Science, ICCS, Omaha, Nebraska, USA, 4-6 June, 2012, ser. Procedia Computer Science, H. H. Ali, Y. Shi, D. Khazanchi, M. Lees, G. D. van Albada, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 9. Elsevier, 2012, pp. 57–66. [Online]. Available: https://doi.org/10.1016/j.procs.2012.04.007
- E. F. D’Azevedo, M. R. Fahey, and R. T. Mills, “Vectorized sparse matrix multiply for compressed row storage format,” in Computational Science - ICCS 2005, 5th International Conference, Atlanta, GA, USA, May 22-25, 2005, Proceedings, Part I, ser. Lecture Notes in Computer Science, V. S. Sunderam, G. D. van Albada, P. M. A. Sloot, and J. J. Dongarra, Eds., vol. 3514. Springer, 2005, pp. 99–106. [Online]. Available: https://doi.org/10.1007/11428831\_13
- A. Monakov, A. Lokhmotov, and A. Avetisyan, “Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures,” in High Performance Embedded Architectures and Compilers, 5th International Conference, HiPEAC 2010, Pisa, Italy, January 25-27, 2010. Proceedings, ser. Lecture Notes in Computer Science, Y. N. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell, Eds., vol. 5952. Springer, 2010, pp. 111–125. [Online]. Available: https://doi.org/10.1007/978-3-642-11515-8\_10
- D. Barbieri, V. Cardellini, A. Fanfarillo, and S. Filippone, “Three Storage Formats for Sparse Matrices on GPGPUs,” 2015.
- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop, “Sparse Matrix-Vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation,” in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum, 2012, pp. 1696–1702.
- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors With Wide SIMD Units,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C401–C423, 2014.
- L. Yuan, Y. Zhang, X. Sun, and T. Wang, “Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format,” in 12th IEEE International Conference on High Performance Computing and Communications, HPCC, 1-3 September 2010, Melbourne, Australia. IEEE, 2010, pp. 585–590. [Online]. Available: https://doi.org/10.1109/HPCC.2010.67
- X. Sun, Y. Zhang, T. Wang, G. Long, X. Zhang, and Y. Li, “CRSD: Application Specific Auto-tuning of SpMV for Diagonal Sparse Matrices,” in Euro-Par Parallel Processing - 17th International Conference, Euro-Par, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part II, ser. Lecture Notes in Computer Science, vol. 6853. Springer, 2011, pp. 316–327. [Online]. Available: https://doi.org/10.1007/978-3-642-23397-5\_32
- X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan, and L. Rao, “Optimizing SpMV for Diagonal Sparse Matrices on GPU,” in International Conference on Parallel Processing, ICPP, Taipei, Taiwan, September 13-16, 2011. IEEE Computer Society, 2011, pp. 492–501. [Online]. Available: https://doi.org/10.1109/ICPP.2011.53
- A. Pinar and M. T. Heath, “Improving performance of sparse matrix-vector multiplication,” in Proceedings of the ACM/IEEE Conference on Supercomputing, SC 1999, November 13-19, 1999, Portland, Oregon, USA. ACM, 1999, p. 30. [Online]. Available: https://doi.org/10.1145/331532.331562
- Y. Li, P. Xie, X. Chen, J. Liu, B. Yang, S. Li, C. Gong, X. Gan, and H. Xu, “VBSF: a new storage format for SIMD sparse matrix-vector multiplication on modern processors,” The Journal of Supercomputing, vol. 76, no. 3, pp. 2063–2081, 2020. [Online]. Available: https://doi.org/10.1007/s11227-019-02835-4
- A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan, “An Efficient Two-Dimensional Blocking Strategy for Sparse Matrix-Vector Multiplication on GPUs,” in International Conference on Supercomputing, ICS’14, Muenchen, Germany, June 10-13, 2014, A. Bode, M. Gerndt, P. Stenström, L. Rauchwerger, B. P. Miller, and M. Schulz, Eds. ACM, 2014, pp. 273–282. [Online]. Available: https://doi.org/10.1145/2597652.2597678
- S. Jain-Mendon and R. Sass, “A case study of streaming storage format for sparse matrices,” in 2012 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2012, Cancun, Mexico, December 5-7, 2012. IEEE, 2012, pp. 1–6. [Online]. Available: https://doi.org/10.1109/ReConFig.2012.6416788
- S. Yan, C. Li, Y. Zhang, and H. Zhou, “yaSpMV: Yet Another SpMV Framework on GPUs,” in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’14. New York, NY, USA: ACM, 2014, pp. 107–118. [Online]. Available: http://doi.acm.org/10.1145/2555243.2555255
- J. Godwin, J. Holewinski, and P. Sadayappan, “High-Performance Sparse Matrix-Vector Multiplication on GPUs for Structured Grid Computations,” in The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, London, United Kingdom, March 3, D. R. Kaeli, J. Cavazos, and E. Sun, Eds. ACM, 2012, pp. 47–56. [Online]. Available: https://doi.org/10.1145/2159430.2159436
- J. Choi, A. Singh, and R. W. Vuduc, “Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs,” in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Bangalore, India, January 9-14, 2010. ACM, 2010, pp. 115–126. [Online]. Available: https://doi.org/10.1145/1693453.1693471
- W. Liu and B. Vinter, “CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, L. N. Bhuyan, F. Chong, and V. Sarkar, Eds. ACM, 2015, pp. 339–350. [Online]. Available: https://doi.org/10.1145/2751205.2751209
- H. Bian, J. Huang, R. Dong, L. Liu, and X. Wang, “CSR2: A New Format for SIMD-Accelerated SpMV,” in 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020, Melbourne, Australia, May 11-14, 2020. IEEE, 2020, pp. 350–359. [Online]. Available: https://doi.org/10.1109/CCGrid49817.2020.00-58
- K. Kourtis, V. Karakasis, G. I. Goumas, and N. Koziris, “CSX: an extended compression format for spmv on shared memory systems,” in Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, February 12-16, 2011, C. Cascaval and P. Yew, Eds. ACM, 2011, pp. 247–256. [Online]. Available: https://doi.org/10.1145/1941553.1941587
- B.-Y. Su and K. Keutzer, “ClSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs,” in Proceedings of the 26th ACM International Conference on Supercomputing, ser. ICS’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 353–364. [Online]. Available: https://doi.org/10.1145/2304576.2304624
- N. Bell and M. Garland, “Efficient Sparse Matrix-Vector Multiplication on CUDA,” NVIDIA Corporation, NVIDIA Technical Report NVR-2008-004, Dec. 2008.
- H. Anzt, T. Cojean, C. Yen-Chen, J. J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M. Tsai, and W. Wang, “Load-Balancing Sparse Matrix Vector Product Kernels on GPUs,” ACM Transactions on Parallel Computing, vol. 7, no. 1, pp. 2:1–2:26, 2020. [Online]. Available: https://doi.org/10.1145/3380930
- A. Maringanti, V. Athavale, and S. B. Patkar, “Acceleration of Conjugate Gradient Method for Circuit Simulation Using CUDA,” in 16th International Conference on High Performance Computing, December 16-19, 2009, Kochi, India, Proceedings, Y. Yang, M. Parashar, R. Muralidhar, and V. K. Prasanna, Eds. IEEE Computer Society, 2009, pp. 438–444. [Online]. Available: https://doi.org/10.1109/HIPC.2009.5433184
- W. Yang, K. Li, and K. Li, “A parallel computing method using blocked format with optimal partitioning for spmv on GPU,” J. Comput. Syst. Sci., vol. 92, pp. 152–170, 2018. [Online]. Available: https://doi.org/10.1016/j.jcss.2017.09.010
- Y. Nagasaka, A. Nukada, and S. Matsuoka, “Adaptive Multi-Level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU,” Procedia Computer Science, vol. 80, pp. 131–142, 2016, international Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S187705091630655X
- J. Bolz, I. Farmer, E. Grinspun, and P. Schröder, “Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” ACM Transactions on Graphics (TOG), vol. 22, no. 3, pp. 917–924, 2003.
- W. Yang, K. Li, Y. Liu, L. Shi, and L. Wan, “Optimization of Quasi-Diagonal Matrix-Vector Multiplication on GPU,” The International Journal of High Performance Computing Applications, vol. 28, no. 2, pp. 183–195, 2014. [Online]. Available: https://doi.org/10.1177/1094342013501126
- J. Gao, W. Ji, Z. Tan, Y. Wang, and F. Shi, “Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on GPU,” IEEE Trans. Parallel Distributed Syst., vol. 33, no. 12, pp. 3732–3745, 2022. [Online]. Available: https://doi.org/10.1109/TPDS.2022.3170501
- W. T. Tang, R. Zhao, M. Lu, Y. Liang, H. P. Huyng, X. Li, and R. S. M. Goh, “Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi,” in Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, San Francisco, CA, USA, February 07 - 11, 2015, K. Olukotun, A. Smith, R. Hundt, and J. Mars, Eds. IEEE Computer Society, 2015, pp. 136–145. [Online]. Available: https://doi.org/10.1109/CGO.2015.7054194
- A. Monakov and A. Avetisyan, “Implementing blocked sparse matrix-vector multiplication on NVIDIA gpus,” in Embedded Computer Systems: Architectures, Modeling, and Simulation, 9th International Workshop, SAMOS 2009, Samos, Greece, July 20-23, 2009. Proceedings, ser. Lecture Notes in Computer Science, K. Bertels, N. J. Dimopoulos, C. Silvano, and S. Wong, Eds., vol. 5657. Springer, 2009, pp. 289–297. [Online]. Available: https://doi.org/10.1007/978-3-642-03138-0\_32
- Z. Tan, W. Ji, J. Gao, Y. Zhao, A. Benatia, Y. Wang, and F. Shi, “MMSparse: 2D Partitioning of Sparse Matrix Based on Mathematical Morphology,” Future Generation Computer Systems, vol. 108, pp. 521 – 532, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X19327967
- Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan, “TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs,” in 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 2021, pp. 68–78. [Online]. Available: https://doi.org/10.1109/IPDPS49936.2021.00016
- J. Willcock and A. Lumsdaine, “Accelerating sparse matrix computations via data compression,” in Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, Cairns, Queensland, Australia, June 28 - July 01, 2006, G. K. Egan and Y. Muraoka, Eds. ACM, 2006, pp. 307–316. [Online]. Available: https://doi.org/10.1145/1183401.1183444
- K. Kourtis, G. I. Goumas, and N. Koziris, “Optimizing sparse matrix-vector multiplication using index and value compression,” in Proceedings of the 5th Conference on Computing Frontiers, 2008, Ischia, Italy, May 5-7, 2008, A. Ramírez, G. Bilardi, and M. Gschwind, Eds. ACM, 2008, pp. 87–96. [Online]. Available: https://doi.org/10.1145/1366230.1366244
- W. T. Tang, W. J. Tan, R. Ray, Y. W. Wong, W. Chen, S. Kuo, R. S. M. Goh, S. J. Turner, and W. Wong, “Accelerating Sparse Matrix-Vector Multiplication on GPUs Using Bit-Representation-Optimized Schemes,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA - November 17 - 21, 2013, W. Gropp and S. Matsuoka, Eds. ACM, 2013, pp. 26:1–26:12. [Online]. Available: https://doi.org/10.1145/2503210.2503234
- W. T. Tang, W. J. Tan, R. S. M. Goh, S. J. Turner, and W.-F. Wong, “A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 9, pp. 2373–2385, 2015.
- S. Kestur, J. D. Davis, and E. S. Chung, “Towards a universal FPGA matrix-vector multiplication architecture,” in 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2012, 29 April - 1 May 2012, Toronto, Ontario, Canada. IEEE Computer Society, 2012, pp. 9–16. [Online]. Available: https://doi.org/10.1109/FCCM.2012.12
- J. M. Mellor-Crummey and J. Garvin, “Optimizing sparse matrix - vector product computations using unroll and jam,” Int. J. High Perform. Comput. Appl., vol. 18, no. 2, pp. 225–236, 2004. [Online]. Available: https://doi.org/10.1177/1094342004038951
- F. Vázquez, J. Fernández, and E. M. Garzón, “A new approach for sparse matrix vector product on NVIDIA gpus,” Concurr. Comput. Pract. Exp., vol. 23, no. 8, pp. 815–826, 2011. [Online]. Available: https://doi.org/10.1002/cpe.1658
- E. J. Anderson and Y. Saad, “Solving sparse triangular linear systems on parallel computers,” Int. J. High Speed Comput., vol. 1, no. 1, pp. 73–95, 1989. [Online]. Available: https://doi.org/10.1142/S0129053389000056
- R. G. Melhem, “Parallel solution of linear systems with striped sparse matrices,” Parallel Comput., vol. 6, no. 2, pp. 165–184, 1988. [Online]. Available: https://doi.org/10.1016/0167-8191(88)90082-8
- E. Montagne and A. Ekambaram, “An optimal storage format for sparse matrices,” Inf. Process. Lett., vol. 90, no. 2, pp. 87–92, 2004. [Online]. Available: https://doi.org/10.1016/j.ipl.2004.01.014
- M. M. Baskaran and R. Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs Using Compile-Time and Run-Time Strategies,” IBM Reserach Report, RC24704 (W0812-047), 2008.
- A. H. E. Zein and A. P. Rendell, “From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation,” in 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGrid, 17-20 May 2010, Melbourne, Victoria, Australia. IEEE Computer Society, 2010, pp. 808–813. [Online]. Available: https://doi.org/10.1109/CCGRID.2010.81
- ——, “Generating Optimal CUDA Sparse Matrix-Vector Product Implementations for Evolving GPU Hardware,” Concurrency Computation: Practice and Experience, vol. 24, no. 1, pp. 3–13, 2012. [Online]. Available: https://doi.org/10.1002/cpe.1732
- S. Dalton, N. Bell, L. Olson, and M. Garland, “Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations,” 2014, version 0.5.0. [Online]. Available: http://cusplibrary.github.io/
- D. Mukunoki and D. Takahashi, “Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs,” in Computational Science and Its Applications - ICCSA 13th International Conference, Ho Chi Minh City, Vietnam, June 24-27, 2013, Proceedings, Part V, ser. Lecture Notes in Computer Science, vol. 7975. Springer, 2013, pp. 211–223. [Online]. Available: https://doi.org/10.1007/978-3-642-39640-3\_15
- I. Reguly and M. Giles, “Efficient Sparse Matrix-Vector Multiplication on Cache-Based GPUs,” in 2012 Innovative Parallel Computing (InPar), May 2012, pp. 1–12.
- H. Yoshizawa and D. Takahashi, “Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs,” in 15th IEEE International Conference on Computational Science and Engineering, CSE 2012, Paphos, Cyprus, December 5-7, 2012. IEEE Computer Society, 2012, pp. 130–136. [Online]. Available: https://doi.org/10.1109/ICCSE.2012.28
- Z. Koza, M. Matyka, S. Szkoda, and Ł. Mirosław, “Compressed Multirow Storage Format for Sparse Matrices on Graphics Processing Units,” SIAM Journal on Scientific Computing, vol. 36, no. 2, pp. C219–C239, 2014. [Online]. Available: https://doi.org/10.1137/120900216
- H. Anzt, T. Cojean, G. Flegar, F. Göbel, T. Grützmacher, P. Nayak, T. Ribizel, Y. M. Tsai, and E. S. Quintana-Ortí, “Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing,” ACM Transactions on Mathematical Software, vol. 48, no. 1, pp. 2:1–2:33, 2022. [Online]. Available: https://doi.org/10.1145/3480935
- G. Flegar and H. Anzt, “Overcoming load imbalance for irregular sparse matrices,” in Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms, IA3@SC 2017, Denver, CO, USA, November 12 - 17, 2017. ACM, 2017, pp. 2:1–2:8. [Online]. Available: https://doi.org/10.1145/3149704.3149767
- S. Liu, Y. Zhang, X. Sun, and R. Qiu, “Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP,” in 11th IEEE International Conference on High Performance Computing and Communications, HPCC, 25-27 June 2009, Seoul, Korea. IEEE, 2009, pp. 659–665. [Online]. Available: https://doi.org/10.1109/HPCC.2009.75
- X. Feng, H. Jin, R. Zheng, K. Hu, J. Zeng, and Z. Shao, “Optimization of sparse matrix-vector multiplication with variant CSR on gpus,” in 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS, Tainan, Taiwan, December 7-9, 2011. IEEE Computer Society, 2011, pp. 165–172. [Online]. Available: https://doi.org/10.1109/ICPADS.2011.91
- J. L. Greathouse and M. Daga, “Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format,” in SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 769–780.
- M. Daga and J. L. Greathouse, “Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices,” in 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 2015, pp. 64–74.
- J. Gao, W. Ji, J. Liu, S. Shao, Y. Wang, and F. Shi, “AMF-CSR: adaptive multi-row folding of CSR for spmv on GPU,” in 27th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2021, Beijing, China, December 14-16, 2021. IEEE, 2021, pp. 418–425. [Online]. Available: https://doi.org/10.1109/ICPADS53394.2021.00058
- Y. Liu and B. Schmidt, “LightSpMV: Faster CSR-Based Sparse Matrix-Vector Multiplication on CUDA-Enabled GPUs,” in 26th IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2015, Toronto, ON, Canada, July 27-29, 2015. IEEE Computer Society, 2015, pp. 82–89. [Online]. Available: https://doi.org/10.1109/ASAP.2015.7245713
- ——, “LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows,” Journal of Signal Processing Systems, vol. 90, no. 1, pp. 69–86, 2018.
- D. Merrill and M. Garland, “Merge-Based Parallel Sparse Matrix-Vector Multiplication,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016, J. West and C. M. Pancake, Eds. IEEE Computer Society, 2016, pp. 678–689. [Online]. Available: https://doi.org/10.1109/SC.2016.57
- G. Flegar and E. S. Quintana-Ortí, “Balanced CSR Sparse Matrix-Vector Product on Graphics Processors,” in Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28 - September 1, 2017, Proceedings, ser. Lecture Notes in Computer Science, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, Eds., vol. 10417. Springer, 2017, pp. 697–709. [Online]. Available: https://doi.org/10.1007/978-3-319-64203-1\_50
- M. Steinberger, R. Zayer, and H. Seidel, “Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU,” in Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14-16, 2017, W. D. Gropp, P. Beckman, Z. Li, and F. J. Cazorla, Eds. ACM, 2017, pp. 13:1–13:11. [Online]. Available: https://doi.org/10.1145/3079079.3079086
- E. Im and K. A. Yelick, “Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY,” in Computational Science - ICCS 2001, International Conference, San Francisco, CA, USA, May 28-30, 2001. Proceedings, Part I, ser. Lecture Notes in Computer Science, V. N. Alexandrov, J. J. Dongarra, B. A. Juliano, R. S. Renner, and C. J. K. Tan, Eds., vol. 2073. Springer, 2001, pp. 127–136. [Online]. Available: https://doi.org/10.1007/3-540-45545-0\_22
- E. Im, K. A. Yelick, and R. W. Vuduc, “Sparsity: Optimization Framework for Sparse Matrix Kernels,” The International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 135–158, 2004. [Online]. Available: https://doi.org/10.1177/1094342004041296
- R. Vuduc, J. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. C. Lee, “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” in Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, November 16-22, 2002, CD-ROM, R. C. Giles, D. A. Reed, and K. Kelley, Eds. IEEE Computer Society, 2002, pp. 35:1–35:35. [Online]. Available: https://doi.org/10.1109/SC.2002.10025
- R. Vuduc, J. W. Demmel, and K. A. Yelick, “OSKI: A Library of Automatically Tuned Sparse Matrix Kernels,” in Journal of Physics: Conference Series, vol. 16, no. 1. IOP Publishing, 2005, p. 521.
- R. Nishtala, R. W. Vuduc, J. Demmel, and K. A. Yelick, “When cache blocking of sparse matrix vector multiply works and why,” Appl. Algebra Eng. Commun. Comput., vol. 18, no. 3, pp. 297–311, 2007. [Online]. Available: https://doi.org/10.1007/s00200-007-0038-9
- X. Zhang, Y. Zhang, X. Sun, F. Liu, S. Liu, Y. Tang, and Y. Li, “Automatic performance tuning of spmv on gpgpu,” HPC Asia, Kaohsiung, Taiwan, China, pp. 173–179, 2009.
- P. Guo, L. Wang, and P. Chen, “A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 5, pp. 1112–1123, 2014. [Online]. Available: https://doi.org/10.1109/TPDS.2013.123
- K. Li, W. Yang, and K. Li, “Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 1, pp. 196–205, 2015.
- F. Vázquez, J. Fernández, and E. M. Garzón, “Automatic Tuning of the Sparse Matrix Vector Product on GPUs Based on the ELLR-T Approach,” Parallel Computing, vol. 38, no. 8, pp. 408–420, 2012. [Online]. Available: https://doi.org/10.1016/j.parco.2011.08.003
- F. Vázquez, G. O. López, J. Fernández, and E. M. Garzón, “Improving the performance of the sparse matrix vector product with gpus,” in 10th IEEE International Conference on Computer and Information Technology, CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010. IEEE Computer Society, 2010, pp. 1146–1151. [Online]. Available: https://doi.org/10.1109/CIT.2010.208
- S. Li, C. Hu, J. Zhang, and Y. Zhang, “Automatic tuning of sparse matrix-vector multiplication on multicore clusters,” Sci. China Inf. Sci., vol. 58, no. 9, pp. 1–14, 2015. [Online]. Available: https://doi.org/10.1007/s11432-014-5254-x
- Y. Chen, G. Xiao, Z. Xiao, and W. Yang, “hpspmv: A heterogeneous parallel computing scheme for spmv on the sunway taihulight supercomputer,” in 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019, Zhangjiajie, China, August 10-12, 2019, Z. Xiao, L. T. Yang, P. Balaji, T. Li, K. Li, and A. Y. Zomaya, Eds. IEEE, 2019, pp. 989–995. [Online]. Available: https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00142
- P. Guo and L. Wang, “Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs,” in International Conference on Computational and Information Sciences. IEEE, 2010, pp. 1154–1157.
- W. A. Abu-Sufah and A. A. Karim, “Auto-tuning of sparse matrix-vector multiplication on graphics processors,” in Supercomputing - 28th International Supercomputing Conference, ISC 2013, Leipzig, Germany, June 16-20, 2013. Proceedings, ser. Lecture Notes in Computer Science, J. M. Kunkel, T. Ludwig, and H. W. Meuer, Eds., vol. 7905. Springer, 2013, pp. 151–164. [Online]. Available: https://doi.org/10.1007/978-3-642-38750-0\_12
- ——, “An effective approach for implementing sparse matrix-vector multiplication on graphics processing units,” in 14th IEEE International Conference on High Performance Computing and Communication & 9th IEEE International Conference on Embedded Software and Systems, HPCC-ICESS 2012, Liverpool, United Kingdom, June 25-27, 2012, G. Min, J. Hu, L. C. Liu, L. T. Yang, S. Seelam, and L. Lefèvre, Eds. IEEE Computer Society, 2012, pp. 453–460. [Online]. Available: https://doi.org/10.1109/HPCC.2012.68
- W. Armstrong and A. P. Rendell, “Reinforcement Learning for Automated Performance Tuning: Initial Evaluation for Sparse Matrix Format Selection,” in 2008 IEEE International Conference on Cluster Computing, 2008, pp. 411–420.
- J. Li, G. Tan, M. Chen, and N. Sun, “SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication,” ACM SIGPLAN Notices, vol. 48, no. 6, p. 117–126, jun 2013. [Online]. Available: https://doi.org/10.1145/2499370.2462181
- N. Sedaghati, T. Mu, L. Pouchet, S. Parthasarathy, and P. Sadayappan, “Automatic selection of sparse matrix representation on gpus,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, L. N. Bhuyan, F. Chong, and V. Sarkar, Eds. ACM, 2015, pp. 99–108. [Online]. Available: https://doi.org/10.1145/2751205.2751244
- S. Chen, J. Fang, D. Chen, C. Xu, and Z. Wang, “Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures,” in 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2018, pp. 649–658.
- A. Benatia, W. Ji, Y. Wang, and F. Shi, “Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU,” in 2016 45th International Conference on Parallel Processing (ICPP), 2016, pp. 496–505.
- I. Mehrez, O. Hamdi-Larbi, T. Dufaud, and N. Emad, “Machine Learning for Optimal Compression Format Prediction on Multiprocessor Platform,” in 2018 International Conference on High Performance Computing Simulation (HPCS), 2018, pp. 213–220.
- K. Hou, W. Feng, and S. Che, “Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017, pp. 713–722.
- A. Benatia, W. Ji, Y. Wang, and F. Shi, “BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU,” ACM Transactions on Architecture and Code Optimization, vol. 15, no. 3, Sep. 2018. [Online]. Available: https://doi.org/10.1145/3226228
- O. Hamdi-Larbi, I. Mehrez, and T. Dufaud, “Machine learning to design an auto-tuning system for the best compressed format detection for parallel sparse computations,” Parallel Process. Lett., vol. 31, no. 4, pp. 2 150 019:1–2 150 019:37, 2021. [Online]. Available: https://doi.org/10.1142/S0129626421500195
- I. Nisa, C. Siegel, A. S. Rajam, A. Vishnu, and P. Sadayappan, “Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs,” in 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018, pp. 1056–1065.
- H. Cui, S. Hirasawa, H. Kobayashi, and H. Takizawa, “A machine learning-based approach for selecting spmv kernels and matrix storage formats,” IEICE Trans. Inf. Syst., vol. 101-D, no. 9, pp. 2307–2314, 2018. [Online]. Available: https://doi.org/10.1587/transinf.2017EDP7176
- Y. Zhao, J. Li, C. Liao, and X. Shen, “Bridging the Gap Between Deep Learning and Sparse Matrix Format Selection,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 94–108. [Online]. Available: https://doi.org/10.1145/3178487.3178495
- W. Zhou, Y. Zhao, X. Shen, and W. Chen, “Enabling Runtime SpMV Format Selection through an Overhead Conscious Method,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 80–93, 2020.
- A. Elafrou, G. Goumas, and N. Koziris, “BASMAT: Bottleneck-Aware Sparse Matrix-Vector Multiplication Auto-Tuning on GPGPUs,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP), Washington, District of Columbia. New York, NY, USA: Association for Computing Machinery, 2019, p. 423–424. [Online]. Available: https://doi.org/10.1145/3293883.3301490
- E. Dufrechou, P. Ezzatti, and E. S. Quintana-Ortí, “Selecting Optimal SpMV Realizations for GPUs via Machine Learning,” The International Journal of High Performance Computing Applications, vol. 35, no. 3, 2021. [Online]. Available: https://doi.org/10.1177/1094342021990738
- G. Xiao, T. Zhou, Y. Chen, Y. Hu, and K. Li, “Dtspmv: An adaptive spmv framework for graph analysis on gpus,” in 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application, HPCC/DSS/SmartCity/DependSys 2022, Hainan, China, December 18-20, 2022. IEEE, 2022, pp. 35–42. [Online]. Available: https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00039
- S. Usman, R. Mehmood, I. Katib, A. Albeshri, and S. M. Altowaijri, “Zaki: A smart method and tool for automatic performance optimization of parallel spmv computations on distributed memory machines,” Mobile Networks and Applications, pp. 1–20, 2019.
- S. Usman, R. Mehmood, I. A. Katib, and A. Albeshri, “ZAKI+: A machine learning based process mapping tool for spmv computations on distributed memory architectures,” IEEE Access, vol. 7, pp. 81 279–81 296, 2019. [Online]. Available: https://doi.org/10.1109/ACCESS.2019.2923565
- M. Ahmed, S. Usman, N. A. Shah, M. U. Ashraf, A. M. Alghamdi, A. A. Bahadded, and K. A. Almarhabi, “AAQAL: A Machine Learning-Based Tool for Performance Optimization of Parallel SpMV Computations Using Block CSR,” Applied Sciences, vol. 12, no. 14, p. 7073, 2022.
- J. Gao, W. Ji, J. Liu, Y. Wang, and F. Shi, “Revisiting thread configuration of spmv kernels on gpu: A machine learning based approach,” Journal of Parallel and Distributed Computing, vol. 185, p. 104799, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731523001697
- A. Benatia, W. Ji, Y. Wang, and F. Shi, “Machine Learning Approach for the Predicting Performance of SpMV on GPU,” in 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 2016, pp. 894–901.
- ——, “Sparse Matrix Partitioning for Optimizing SpMV on CPU-GPU Heterogeneous Platforms,” The International Journal of High Performance Computing Applications, vol. 34, no. 1, pp. 66–80, 2020. [Online]. Available: https://doi.org/10.1177/1094342019886628
- M. Barreda, M. F. Dolz, M. A. Castaño, P. Alonso-Jordá, and E. S. Quintana-Orti, “Performance Modeling of the Sparse Matrix-Vector Product via Convolutional Neural Networks,” The Journal of Supercomputing, vol. 76, no. 11, pp. 8883–8900, 2020.
- M. Barreda, M. F. Dolz, and M. A. Castano, “Convolutional Neural Nets for Estimating the Run Time and Energy Consumption of the Sparse Matrix-Vector Product,” The International Journal of High Performance Computing Applications, vol. 35, no. 3, pp. 268–281, 2021.
- E. C. Carson and N. J. Higham, “Accelerating the solution of linear systems by iterative refinement in three precisions,” SIAM J. Sci. Comput., vol. 40, no. 2, 2018. [Online]. Available: https://doi.org/10.1137/17M1140819
- S. Gratton, E. Simon, D. Titley-Péloquin, and P. L. Toint, “Exploiting variable precision in GMRES,” CoRR, vol. abs/1907.10550, 2019. [Online]. Available: http://arxiv.org/abs/1907.10550
- J. I. Aliaga, H. Anzt, T. Grützmacher, E. S. Quintana-Ortí, and A. E. Tomás, “Compressed basis GMRES on high-performance graphics processing units,” Int. J. High Perform. Comput. Appl., vol. 37, no. 2, pp. 82–100, 2023. [Online]. Available: https://doi.org/10.1177/10943420221115140
- J. A. Loe, C. A. Glusa, I. Yamazaki, E. G. Boman, and S. Rajamanickam, “A study of mixed precision strategies for gmres on gpus,” arXiv preprint arXiv:2109.01232, 2021.
- M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory computing,” Nature Electronics, vol. 1, no. 4, pp. 246–253, 2018.
- N. Lindquist, P. Luszczek, and J. J. Dongarra, “Improving the performance of the GMRES method using mixed-precision techniques,” in Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI - 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020, Oak Ridge, TN, USA, August 26-28, 2020, Revised Selected Papers, ser. Communications in Computer and Information Science, J. Nichols, B. Verastegui, A. B. Maccabe, O. R. Hernandez, S. Parete-Koon, and T. Ahearn, Eds., vol. 1315. Springer, 2020, pp. 51–66. [Online]. Available: https://doi.org/10.1007/978-3-030-63393-6\_4
- K. Ahmad, H. Sundar, and M. W. Hall, “Data-driven mixed precision sparse matrix vector multiplication for gpus,” ACM Trans. Archit. Code Optim., vol. 16, no. 4, pp. 51:1–51:24, 2020. [Online]. Available: https://doi.org/10.1145/3371275
- D. Mukunoki and T. Ogita, “Performance and energy consumption of accurate and mixed-precision linear algebra kernels on gpus,” J. Comput. Appl. Math., vol. 372, p. 112701, 2020. [Online]. Available: https://doi.org/10.1016/j.cam.2019.112701
- E. Tezcan, T. Torun, F. Kosar, K. Kaya, and D. Unat, “Mixed and multi-precision spmv for gpus with row-wise precision selection,” in 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Bordeaux, France, November 2-5, 2022. IEEE, 2022, pp. 31–40. [Online]. Available: https://doi.org/10.1109/SBAC-PAD55451.2022.00014
- S. Graillat, F. Jézéquel, T. Mary, and R. Molina, “Adaptive precision matrix-vector product,” Feb. 2022, working paper or preprint. [Online]. Available: https://hal.science/hal-03561193
- K. Isupov, “Multiple-precision sparse matrix-vector multiplication on gpus,” J. Comput. Sci., vol. 61, p. 101609, 2022. [Online]. Available: https://doi.org/10.1016/j.jocs.2022.101609
- T. Kouya, “A highly efficient implementation of multiple precision sparse matrix-vector multiplication and its application to product-type krylov subspace methods,” CoRR, vol. abs/1411.2377, 2014. [Online]. Available: http://arxiv.org/abs/1411.2377
- H. Pabst, B. Bachmayer, and M. Klemm, “Performance of a structure-detecting spmv using the CSR matrix representation,” in 11th International Symposium on Parallel and Distributed Computing, ISPDC 2012, Munich, Germany, June 25-29, 2012, M. Bader, H. Bungartz, D. Grigoras, M. Mehl, R. Mundani, and R. Potolea, Eds. IEEE Computer Society, 2012, pp. 3–10. [Online]. Available: https://doi.org/10.1109/ISPDC.2012.9
- Y. Zhang, W. Yang, K. Li, D. Tang, and K. Li, “Performance analysis and optimization for spmv based on aligned storage formats on an ARM processor,” J. Parallel Distributed Comput., vol. 158, pp. 126–137, 2021. [Online]. Available: https://doi.org/10.1016/j.jpdc.2021.08.002
- S. Williams, L. Oliker, R. W. Vuduc, J. Shalf, K. A. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging multicore platforms,” in Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, SC 2007, November 10-16, 2007, Reno, Nevada, USA, B. Verastegui, Ed. ACM Press, 2007, p. 38. [Online]. Available: https://doi.org/10.1145/1362622.1362674
- ——, “Optimization of sparse matrix-vector multiplication on emerging multicore platforms,” Parallel Comput., vol. 35, no. 3, pp. 178–194, 2009. [Online]. Available: https://doi.org/10.1016/j.parco.2008.12.006
- O. Kislal, W. Ding, M. Kandemir, and I. Demirkiran, “Optimizing sparse matrix vector multiplication on emerging multicores,” in 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS). IEEE, 2013, pp. 1–10.
- E. Yuan, Y. Zhang, and X. Sun, “Memory access complexity analysis of spmv in RAM (h) model,” in 10th IEEE International Conference on High Performance Computing and Communications, HPCC 2008, 25-27 Sept. 2008, Dalian, China. IEEE Computer Society, 2008, pp. 913–920. [Online]. Available: https://doi.org/10.1109/HPCC.2008.130
- B. C. Lee, R. Vuduc, J. W. Demmel, K. A. Yelick, M. deLorimier, and L. Zhong, “Performance optimizations and bounds for sparse symmetric matrix-multiple vector multiply,” University of California, Berkeley, Berkeley, CA, USA, Tech. Rep. UCB/CSD-03-1297, 2003.
- A. Elafrou, G. I. Goumas, and N. Koziris, “Performance analysis and optimization of sparse matrix-vector multiplication on modern multi- and many-core processors,” in 46th International Conference on Parallel Processing, ICPP 2017, Bristol, United Kingdom, August 14-17, 2017. IEEE Computer Society, 2017, pp. 292–301. [Online]. Available: https://doi.org/10.1109/ICPP.2017.38
- J. D. Trotter, S. Ekmekçibasi, J. Langguth, T. Torun, E. Düzakin, A. Ilic, and D. Unat, “Bringing order to sparsity: A sparse matrix reordering study on multicore cpus,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, D. Arnold, R. M. Badia, and K. M. Mohror, Eds. ACM, 2023, pp. 31:1–31:13. [Online]. Available: https://doi.org/10.1145/3581784.3607046
- X. Yu, H. Ma, Z. Qu, J. Fang, and W. Liu, “Numa-aware optimization of sparse matrix-vector multiplication on armv8-based many-core architectures,” in Network and Parallel Computing - 17th IFIP WG 10.3 International Conference, NPC 2020, Zhengzhou, China, September 28-30, 2020, Revised Selected Papers, ser. Lecture Notes in Computer Science, X. He, E. Shao, and G. Tan, Eds., vol. 12639. Springer, 2020, pp. 231–242. [Online]. Available: https://doi.org/10.1007/978-3-030-79478-1\_20
- J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable Parallel Programming With CUDA: Is CUDA the Parallel Programming Model That Application Developers Have Been Waiting For?” Queue, vol. 6, no. 2, p. 40–53, Mar 2008.
- M. M. Baskaran and R. Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs,” IBM Research Report RC24704, no. W0812–047, 2009.
- Y. Deng, B. D. Wang, and S. Mu, “Taming irregular EDA applications on gpus,” in 2009 International Conference on Computer-Aided Design, ICCAD 2009, San Jose, CA, USA, November 2-5, 2009, J. S. Roychowdhury, Ed. ACM, 2009, pp. 539–546. [Online]. Available: https://doi.org/10.1145/1687399.1687501
- K. He, S. X. Tan, H. Zhao, X. Liu, H. Wang, and G. Shi, “Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms,” Integr., vol. 52, pp. 10–22, 2016. [Online]. Available: https://doi.org/10.1016/j.vlsi.2015.07.005
- E. Karimi, N. B. Agostini, S. Dong, and D. R. Kaeli, “VCSR: an efficient GPU memory-aware sparse format,” IEEE Trans. Parallel Distributed Syst., vol. 33, no. 10, pp. 3977–3989, 2022. [Online]. Available: https://doi.org/10.1109/TPDS.2022.3177291
- Y. Lu and W. Liu, “DASP: specific dense matrix multiply-accumulate units accelerated general sparse matrix-vector multiplication,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, D. Arnold, R. M. Badia, and K. M. Mohror, Eds. ACM, 2023, pp. 73:1–73:14. [Online]. Available: https://doi.org/10.1145/3581784.3607051
- D. Grewe and A. Lokhmotov, “Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation,” in Proceedings of 4th Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2011, Newport Beach, CA, USA, March 5, 2011. ACM, 2011, p. 12. [Online]. Available: https://doi.org/10.1145/1964179.1964196
- A. Cevahir, A. Nukada, and S. Matsuoka, “Fast Conjugate Gradients with Multiple GPUs,” in Computational Science - ICCS, 9th International Conference, Baton Rouge, LA, USA, May 25-27, 2009, Proceedings, Part I, ser. Lecture Notes in Computer Science, G. Allen, J. Nabrzyski, E. Seidel, G. D. van Albada, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 5544. Springer, 2009, pp. 893–903. [Online]. Available: https://doi.org/10.1007/978-3-642-01970-8\_90
- P. Guo and C. Zhang, “Performance Optimization for SpMV on Multi-GPU Systems Using Threads and Multiple Streams,” in International Symposium on Computer Architecture and High Performance Computing Workshops, SBAC-PAD Workshops , Los Angeles, CA, USA, October 26-28, 2016. IEEE Computer Society, 2016, pp. 67–72. [Online]. Available: https://doi.org/10.1109/SBAC-PADW.2016.20
- M. Karwacki, B. Bylina, and J. Bylina, “Multi-GPU Implementation of the Uniformization Method for Solving Markov Models,” in Federated conference on computer science and information systems (FedCSIS). IEEE, 2012, pp. 533–537.
- M. Verschoor and A. C. Jalba, “Analysis and Performance Estimation of the Conjugate Gradient Method on Multiple GPUs,” Parallel Computing, vol. 38, no. 10-11, pp. 552–575, 2012. [Online]. Available: https://doi.org/10.1016/j.parco.2012.07.002
- B. Yang, H. Liu, and Z. Chen, “Preconditioned GMRES Solver on Multiple-GPU Architecture,” Computers and Mathematics with Applications, vol. 72, no. 4, pp. 1076–1095, 2016. [Online]. Available: https://doi.org/10.1016/j.camwa.2016.06.027
- G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998. [Online]. Available: https://doi.org/10.1137/S1064827595287997
- J. Gao, Y. Wang, J. Wang, and R. Liang, “Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs,” ACM Transactions on Parallel Computing, vol. 3, no. 3, pp. 16:1–16:33, 2016. [Online]. Available: https://doi.org/10.1145/2990849
- J. Gao, Y. Zhou, G. He, and Y. Xia, “A Multi-GPU Parallel Optimization Model for the Preconditioned Conjugate Gradient Algorithm,” Parallel Computing, vol. 63, pp. 1–16, 2017. [Online]. Available: https://doi.org/10.1016/j.parco.2017.04.003
- J. Gao, Y. Wang, and J. Wang, “A Novel Multi-Graphics Processing Unit Parallel Optimization Framework for the Sparse Matrix-Vector Multiplication,” Concurrency Computation Practice and Experience, vol. 29, no. 5, 2017. [Online]. Available: https://doi.org/10.1002/cpe.3936
- C. Li, M. Tang, R. Tong, M. Cai, J. Zhao, and D. Manocha, “P-cloth: interactive complex cloth simulation on multi-gpu systems using dynamic matrix assembly and pipelined implicit integrators,” ACM Trans. Graph., vol. 39, no. 6, pp. 180:1–180:15, 2020. [Online]. Available: https://doi.org/10.1145/3414685.3417763
- J. Chen, C. Xie, J. S. Firoz, J. Li, S. L. Song, K. J. Barker, M. Raugas, and A. Li, “MSREP: A fast yet light sparse matrix framework for multi-gpu systems,” CoRR, vol. abs/2209.07552, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.07552
- D. Schaa and D. R. Kaeli, “Exploring the Multiple-GPU Design Space,” in 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009. IEEE, 2009, pp. 1–12. [Online]. Available: https://doi.org/10.1109/IPDPS.2009.5161068
- A. Abdelfattah, H. Ltaief, and D. E. Keyes, “High Performance Multi-GPU SpMV for Multi-Component PDE-Based Applications,” in Euro-Par: Parallel Processing - 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings, ser. Lecture Notes in Computer Science, J. L. Träff, S. Hunold, and F. Versaci, Eds., vol. 9233. Springer, 2015, pp. 601–612. [Online]. Available: https://doi.org/10.1007/978-3-662-48096-0\_46
- Y. Shan, T. Wu, Y. Wang, B. Wang, Z. Wang, N. Xu, and H. Yang, “FPGA and GPU implementation of large scale spmv,” in IEEE 8th Symposium on Application Specific Processors, SASP 2010, Anaheim, CA, USA, June 13-14, 2010. IEEE Computer Society, 2010, pp. 64–70. [Online]. Available: https://doi.org/10.1109/SASP.2010.5521144
- Y. Umuroglu and M. Jahre, “An energy efficient column-major backend for FPGA spmv accelerators,” in 32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 19-22, 2014. IEEE Computer Society, 2014, pp. 432–439. [Online]. Available: https://doi.org/10.1109/ICCD.2014.6974716
- J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt, “A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication,” in 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, Boston, MA, USA, May 11-13, 2014. IEEE Computer Society, 2014, pp. 36–43. [Online]. Available: https://doi.org/10.1109/FCCM.2014.23
- J. Naher, C. Gloster, C. C. Doss, and S. S. Jadhav, “Using machine learning to estimate utilization and throughput for opencl-based matrix-vector multiplication (MVM),” in 10th Annual Computing and Communication Workshop and Conference, CCWC 2020, Las Vegas, NV, USA, January 6-8, 2020. IEEE, 2020, pp. 365–372. [Online]. Available: https://doi.org/10.1109/CCWC47524.2020.9031173
- Y. Umuroglu and M. Jahre, “A vector caching scheme for streaming FPGA spmv accelerators,” in Applied Reconfigurable Computing - 11th International Symposium, ARC 2015, Bochum, Germany, April 13-17, 2015, Proceedings, ser. Lecture Notes in Computer Science, K. Sano, D. Soudris, M. Hübner, and P. C. Diniz, Eds., vol. 9040. Springer, 2015, pp. 15–26. [Online]. Available: https://doi.org/10.1007/978-3-319-16214-0\_2
- ——, “Random access schemes for efficient FPGA spmv acceleration,” Microprocess. Microsystems, vol. 47, pp. 321–332, 2016. [Online]. Available: https://doi.org/10.1016/j.micpro.2016.02.015
- F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. T. Pileggi, and F. Franchetti, “Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 2019, pp. 347–358. [Online]. Available: https://doi.org/10.1145/3352460.3358330
- M. Hosseinabady and J. L. Núñez-Yáñez, “A streaming dataflow engine for sparse matrix-vector multiplication using high-level synthesis,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 39, no. 6, pp. 1272–1285, 2020. [Online]. Available: https://doi.org/10.1109/TCAD.2019.2912923
- G. Oyarzun, D. Peyrolon, C. Álvarez, and X. Martorell, “An FPGA cached sparse matrix vector product (spmv) for unstructured computational fluid dynamics simulations,” CoRR, vol. abs/2107.12371, 2021. [Online]. Available: https://arxiv.org/abs/2107.12371
- A. Parravicini, L. G. Cellamare, M. Siracusa, and M. D. Santambrogio, “Scaling up HBM efficiency of top-k spmv for approximate embedding similarity on fpgas,” in 58th ACM/IEEE Design Automation Conference, DAC 2021, San Francisco, CA, USA, December 5-9, 2021. IEEE, 2021, pp. 799–804. [Online]. Available: https://doi.org/10.1109/DAC18074.2021.9586203
- B. Liu and D. Liu, “Towards high-bandwidth-utilization spmv on fpgas via partial vector duplication,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference, ASPDAC 2023, Tokyo, Japan, January 16-19, 2023, A. Takahashi, Ed. ACM, 2023, pp. 33–38. [Online]. Available: https://doi.org/10.1145/3566097.3567839
- M. Mahadurkar, N. Sivanandan, and S. Kala, “Hardware acceleration of spmv multiplier for deep learning,” in 25th International Symposium on VLSI Design and Test, VDAT 2021, Surat, India, September 16-18, 2021. IEEE, 2021, pp. 1–6. [Online]. Available: https://doi.org/10.1109/VDAT53777.2021.9600988
- T. Nguyen, C. MacLean, M. Siracusa, D. Doerfler, N. J. Wright, and S. Williams, “Fpga-based HPC accelerators: An evaluation on performance and energy efficiency,” Concurr. Comput. Pract. Exp., vol. 34, no. 20, 2022. [Online]. Available: https://doi.org/10.1002/cpe.6570
- F. Favaro, E. Dufrechou, J. P. Oliver, and P. Ezzatti, “Optimizing the performance of the sparse matrix–vector multiplication kernel in fpga guided by the roofline model,” Micromachines, vol. 14, no. 11, p. 2030, Oct. 2023. [Online]. Available: http://dx.doi.org/10.3390/mi14112030
- X. Xie, Z. Liang, P. Gu, A. Basak, L. Deng, L. Liang, X. Hu, and Y. Xie, “Spacea: Sparse matrix vector multiplication on processing-in-memory accelerator,” in IEEE International Symposium on High-Performance Computer Architecture, HPCA 2021, Seoul, South Korea, February 27 - March 3, 2021. IEEE, 2021, pp. 570–583. [Online]. Available: https://doi.org/10.1109/HPCA51647.2021.00055
- W. Sun, Z. Li, S. Yin, S. Wei, and L. Liu, “ABC-DIMM: alleviating the bottleneck of communication in dimm-based near-memory processing with inter-dimm broadcast,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021. IEEE, 2021, pp. 237–250. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00027
- C. Giannoula, I. Fernandez, J. Gómez-Luna, N. Koziris, G. I. Goumas, and O. Mutlu, “Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures,” Proc. ACM Meas. Anal. Comput. Syst., vol. 6, no. 1, pp. 21:1–21:49, 2022. [Online]. Available: https://doi.org/10.1145/3508041
- X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, “Efficient sparse matrix-vector multiplication on x86-based many-core processors,” in International Conference on Supercomputing, ICS’13, Eugene, OR, USA - June 10 - 14, 2013, A. D. Malony, M. Nemirovsky, and S. P. Midkiff, Eds. ACM, 2013, pp. 273–282. [Online]. Available: https://doi.org/10.1145/2464996.2465013
- X. Chen, P. Xie, L. Chi, J. Liu, and C. Gong, “An efficient SIMD compression format for sparse matrix-vector multiplication,” Concurr. Comput. Pract. Exp., vol. 30, no. 23, 2018. [Online]. Available: https://doi.org/10.1002/cpe.4800
- B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He, and L. Zhang, “CVR: efficient vectorization of spmv on x86 processors,” in Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, Vösendorf / Vienna, Austria, February 24-28, 2018, J. Knoop, M. Schordan, T. Johnson, and M. F. P. O’Boyle, Eds. ACM, 2018, pp. 149–162. [Online]. Available: https://doi.org/10.1145/3168818
- C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu, “Towards efficient spmv on sunway manycore architectures,” in Proceedings of the 32nd International Conference on Supercomputing, ICS 2018, Beijing, China, June 12-15, 2018. ACM, 2018, pp. 363–373. [Online]. Available: https://doi.org/10.1145/3205289.3205313
- Y. Chen, G. Xiao, F. Wu, Z. Tang, and K. Li, “tpspmv: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures,” Inf. Sci., vol. 523, pp. 279–295, 2020. [Online]. Available: https://doi.org/10.1016/j.ins.2020.03.020
- G. Xiao, Y. Chen, C. Liu, and X. Zhou, “ahspmv: An autotuning hybrid computing scheme for spmv on the sunway architecture,” IEEE Internet Things J., vol. 7, no. 3, pp. 1736–1744, 2020. [Online]. Available: https://doi.org/10.1109/JIOT.2019.2947257
- I. Mehrez and O. Hamdi-Larbi, “SMVP distribution using hypergraph model and S-GBNZ algorithm,” in Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, 3PGCIC 2013, Compiegne, France, October 28-30, 2013, F. Xhafa, L. Barolli, D. Nace, S. Venticinque, and A. Bui, Eds. IEEE, 2013, pp. 235–241. [Online]. Available: https://doi.org/10.1109/3PGCIC.2013.41
- H. Mi, X. Yu, X. Yu, S. Wu, and W. Liu, “Balancing computation and communication in distributed sparse matrix-vector multiplication,” in 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023, Bangalore, India, May 1-4, 2023, Y. Simmhan, I. Altintas, A. L. Varbanescu, P. Balaji, A. S. Prasad, and L. Carnevale, Eds. IEEE, 2023, pp. 535–544. [Online]. Available: https://doi.org/10.1109/CCGrid57682.2023.00056
- B. A. Page and P. M. Kogge, “Scalability of hybrid spmv with hypergraph partitioning and vertex delegation for communication avoidance,” in International Conference on High Performance Computing & Simulation (HPCS 2020), 2021.
- C. Mayer, R. Mayer, S. Bhowmik, L. Epple, and K. Rothermel, “HYPE: massive hypergraph partitioning with neighborhood expansion,” in IEEE International Conference on Big Data (IEEE BigData 2018), Seattle, WA, USA, December 10-13, 2018, N. Abe, H. Liu, C. Pu, X. Hu, N. K. Ahmed, M. Qiao, Y. Song, D. Kossmann, B. Liu, K. Lee, J. Tang, J. He, and J. S. Saltz, Eds. IEEE, 2018, pp. 458–467. [Online]. Available: https://doi.org/10.1109/BigData.2018.8621968
- S. Lin and Z. Xie, “A Jacobi_PCG Solver for Sparse Linear Systems on Multi-GPU Cluster,” The Journal of Supercomputing, vol. 73, no. 1, pp. 433–454, 2017. [Online]. Available: https://doi.org/10.1007/s11227-016-1887-4
- L. Z. Khodja, R. Couturier, A. Giersch, and J. M. Bahi, “Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters,” J. Supercomput., vol. 69, no. 1, pp. 200–224, 2014. [Online]. Available: https://doi.org/10.1007/s11227-014-1143-8
- B. Bylina, J. Bylina, P. Stpiczynski, and D. Szalkowski, “Performance analysis of multicore and multinodal implementation of spmv operation,” in Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, Warsaw, Poland, September 7-10, 2014, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 2, 2014, pp. 569–576. [Online]. Available: https://doi.org/10.15439/2014F313
- S. Lee and R. Eigenmann, “Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems,” in Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7-12, 2008, P. Zhou, Ed. ACM, 2008, pp. 195–204. [Online]. Available: https://doi.org/10.1145/1375527.1375558
- W. Ma, Y. Hu, W. Yuan, and X. Liu, “Developing a multi-gpu-enabled preconditioned gmres with inexact triangular solves for block sparse matrices,” Mathematical Problems in Engineering, vol. 2021, pp. 1–17, 2021.
- S. R. K. B. Indarapu, M. K. Maramreddy, and K. Kothapalli, “Architecture- and workload- aware heterogeneous algorithms for sparse matrix vector multiplication,” in Proceedings of the 7th ACM India Computing Conference, COMPUTE 2014, Nagpur, India, October 9-11, 2014, P. Bhattacharyya, P. J. Narayanan, and S. Padmanabhuni, Eds. ACM, 2014, pp. 3:1–3:9. [Online]. Available: https://doi.org/10.1145/2675744.2675749
- V. Cardellini, A. Fanfarillo, and S. Filippone, “Heterogeneous sparse matrix computations on hybrid GPU/CPU platforms,” in Parallel Computing: Accelerating Computational Science and Engineering (CSE), Proceedings of the International Conference on Parallel Computing, ParCo 2013, 10-13 September 2013, Garching (near Munich), Germany, ser. Advances in Parallel Computing, M. Bader, A. Bode, H. Bungartz, M. Gerndt, G. R. Joubert, and F. J. Peters, Eds., vol. 25. IOS Press, 2013, pp. 203–212. [Online]. Available: https://doi.org/10.3233/978-1-61499-381-0-203
- W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization using partitioned spmv on gpus and multicore cpus,” IEEE Trans. Computers, vol. 64, no. 9, pp. 2623–2636, 2015. [Online]. Available: https://doi.org/10.1109/TC.2014.2366731
- W. Yang, K. Li, and K. Li, “A hybrid computing method of spmv on CPU-GPU heterogeneous computing systems,” J. Parallel Distributed Comput., vol. 104, pp. 49–60, 2017. [Online]. Available: https://doi.org/10.1016/j.jpdc.2016.12.023
- T. D. Braun, H. J. Siegel, N. Beck, L. Bölöni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys, B. Yao, D. A. Hensgen, and R. F. Freund, “A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems,” J. Parallel Distributed Comput., vol. 61, no. 6, pp. 810–837, 2001. [Online]. Available: https://doi.org/10.1006/jpdc.2000.1714
- W. Liu and B. Vinter, “Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors,” Parallel Comput., vol. 49, pp. 179–193, 2015. [Online]. Available: https://doi.org/10.1016/j.parco.2015.04.004
- G. Xiao, K. Li, Y. Chen, W. He, A. Y. Zomaya, and T. Li, “Caspmv: A customized and accelerative spmv framework for the sunway taihulight,” IEEE Trans. Parallel Distributed Syst., vol. 32, no. 1, pp. 131–146, 2021. [Online]. Available: https://doi.org/10.1109/TPDS.2019.2907537
- W. Li, H. Cheng, Z. Lu, Y. Lu, and W. Liu, “Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors,” in 2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE Computer Society, 2023, pp. 209–220.
- T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1:1–1:25, 2011. [Online]. Available: https://doi.org/10.1145/2049662.2049663
- NVIDIA, “NVIDIA cuSPARSE Library,” 2023. [Online]. Available: https://docs.nvidia.com/cuda/archive/12.0.0/index.html
- H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra, “Optimizing Krylov Subspace Solvers on Graphics Processing Units,” in Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, IEEE. Phoenix, AZ: IEEE, 05-2014 2014.
- I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, “Improving the Performance of CA-GMRES on Multicores with Multiple GPUs,” in IPDPS 2014. Phoenix, AZ: IEEE, 05-2014 2014.
- D. Merrill, “Merge-based Parallel Sparse Matrix-Vector Multiplication,” https://github.com/dumerrill/merge-spmv, (Accessed on 10/5/2023).
- T. Muhammed, R. Mehmood, A. Albeshri, and I. Katib, “SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs,” Applied Sciences, vol. 9, no. 5, p. 947, Mar 2019. [Online]. Available: http://dx.doi.org/10.3390/app9050947
- J.-H. Byun, R. Lin, K. A. Yelick, and J. Demmel, “Autotuning sparse matrix-vector multiplication for multicore,” EECS, UC Berkeley, Tech. Rep, 2012.
- G. Tan, J. Liu, and J. Li, “Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture,” ACM Transactions on Mathematical Software, vol. 44, no. 4, pp. 46:1–46:25, 2018. [Online]. Available: https://doi.org/10.1145/3218823
- C. Chen, “Explicit caching hyb: a new high-performance spmv framework on gpgpu,” arXiv preprint arXiv:2204.06666, 2022.
- Z. Du, J. Li, Y. Wang, X. Li, G. Tan, and N. Sun, “AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022. IEEE, 2022, pp. 1–15. [Online]. Available: https://doi.org/10.1109/SC41404.2022.00071
- Jianhua Gao (8 papers)
- Bingjie Liu (3 papers)
- Weixing Ji (9 papers)
- Hua Huang (70 papers)