Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Systematic Literature Survey of Sparse Matrix-Vector Multiplication (2404.06047v1)

Published 9 Apr 2024 in cs.DC

Abstract: Sparse matrix-vector multiplication (SpMV) is a crucial computing kernel with widespread applications in iterative algorithms. Over the past decades, research on SpMV optimization has made remarkable strides, giving rise to various optimization contributions. However, the comprehensive and systematic literature survey that introduces, analyzes, discusses, and summarizes the advancements of SpMV in recent years is currently lacking. Aiming to fill this gap, this paper compares existing techniques and analyzes their strengths and weaknesses. We begin by highlighting two representative applications of SpMV, then conduct an in-depth overview of the important techniques that optimize SpMV on modern architectures, which we specifically classify as classic, auto-tuning, machine learning, and mixed-precision-based optimization. We also elaborate on the hardware-based architectures, including CPU, GPU, FPGA, processing in Memory, heterogeneous, and distributed platforms. We present a comprehensive experimental evaluation that compares the performance of state-of-the-art SpMV implementations. Based on our findings, we identify several challenges and point out future research directions. This survey is intended to provide researchers with a comprehensive understanding of SpMV optimization on modern architectures and provide guidance for future work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (213)
  1. M. A. V. Kulkarni and P. Barde, “A survey on performance modelling and optimization techniques for spmv on gpus,” Int. J. Comput. Sci. Inf. Technol, vol. 5, pp. 7577–7582, 2014.
  2. M. Grossman, C. Thiele, M. Araya-Polo, F. Frank, F. O. Alpak, and V. Sarkar, “A survey of sparse matrix-vector multiplication performance on large matrices,” arXiv preprint arXiv:1608.00636, 2016.
  3. S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, “Sparse matrix-vector multiplication on gpgpus,” ACM Trans. Math. Softw., vol. 43, no. 4, pp. 30:1–30:49, 2017. [Online]. Available: https://doi.org/10.1145/3017994
  4. Q. Wang, M. Li, J. Pang, and D. Zhu, “Research on performance optimization for sparse matrix-vector multiplication in multi/many-core architecture,” in 2020 2nd International Conference on Information Technology and Computer Application (ITCA).   IEEE, 2020, pp. 350–362.
  5. G. Xiao, C. Yin, T. Zhou, X. Li, Y. Chen, and K. Li, “A survey of accelerating parallel sparse linear algebra,” ACM Comput. Surv., vol. 56, no. 1, aug 2023. [Online]. Available: https://doi.org/10.1145/3604606
  6. X. Fu, B. Zhang, T. Wang, W. Li, Y. Lu, E. Yi, J. Zhao, X. Geng, F. Li, J. Zhang, Z. Jin, and W. Liu, “Pangulu: A scalable regular two-dimensional block-cyclic sparse direct solver on distributed heterogeneous systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, D. Arnold, R. M. Badia, and K. M. Mohror, Eds.   ACM, 2023, pp. 51:1–51:14. [Online]. Available: https://doi.org/10.1145/3581784.3607050
  7. Y. Saad and M. H. Schultz, “Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM Journal on scientific and statistical computing, vol. 7, no. 3, pp. 856–869, 1986.
  8. K. Suzuki, T. Fukaya, and T. Iwashita, “A novel ILU preconditioning method with a block structure suitable for SIMD vectorization,” J. Comput. Appl. Math., vol. 419, p. 114687, 2023. [Online]. Available: https://doi.org/10.1016/j.cam.2022.114687
  9. X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q. Hua, “Graph processing on gpus: A survey,” ACM Comput. Surv., vol. 50, no. 6, pp. 81:1–81:35, 2018. [Online]. Available: https://doi.org/10.1145/3128571
  10. H. Tong, C. Faloutsos, and J. Pan, “Random walk with restart: fast solutions and applications,” Knowl. Inf. Syst., vol. 14, no. 3, pp. 327–346, 2008. [Online]. Available: https://doi.org/10.1007/s10115-007-0094-2
  11. A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, “Fast sparse matrix-vector multiplication on gpus for graph applications,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16-21, 2014, T. Damkroger and J. J. Dongarra, Eds.   IEEE Computer Society, 2014, pp. 781–792. [Online]. Available: https://doi.org/10.1109/SC.2014.69
  12. T. Wu, B. Wang, Y. Shan, F. Yan, Y. Wang, and N. Xu, “Efficient pagerank and spmv computation on AMD gpus,” in 39th International Conference on Parallel Processing, ICPP 2010, San Diego, California, USA, 13-16 September 2010.   IEEE Computer Society, 2010, pp. 81–89. [Online]. Available: https://doi.org/10.1109/ICPP.2010.17
  13. J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, no. 5, pp. 604–632, 1999. [Online]. Available: https://doi.org/10.1145/324133.324140
  14. S. Brin and L. Page, “Reprint of: The anatomy of a large-scale hypertextual web search engine,” Comput. Networks, vol. 56, no. 18, pp. 3825–3833, 2012. [Online]. Available: https://doi.org/10.1016/j.comnet.2012.10.007
  15. G. V. Paolini and G. R. D. Brozolo, “Data structures to vectorize cg algorithms for general sparsity patterns,” BIT Numerical Mathematics, vol. 29, no. 4, pp. 703–718, 1989.
  16. A. Peters, “Sparse matrix vector multiplication techniques on the IBM 3090 VF,” Parallel Comput., vol. 17, no. 12, pp. 1409–1424, 1991. [Online]. Available: https://doi.org/10.1016/S0167-8191(05)80007-9
  17. Y. Saad, “SPARSKIT: A Basic Took Kit for Sparse Matrix Computations, Version 2,” http://www. cs. umn. edu/saad/software/SPARSKIT/sparskit. html, 1994.
  18. N. Bell and M. Garland, “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors,” in Proceedings of the ACM/IEEE Conference on High Performance Computing, SC, November 14-20, 2009, Portland, Oregon, USA.   ACM, 2009, pp. 1–11. [Online]. Available: https://doi.org/10.1145/1654059.1654078
  19. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan Primitives for GPU Computing,” in Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, ser. GH’07.   Goslar, DEU: Eurographics Association, 2007, p. 97–106.
  20. M. Garland, “Sparse Matrix Computations on Manycore GPU’s,” in Proceedings of the 45th annual Design Automation Conference, ser. DAC ’08.   New York, NY, USA: Association for Computing Machinery, Jun 2008, p. 2–6. [Online]. Available: https://doi.org/10.1145/1391469.1391473
  21. H. Dang and B. Schmidt, “The Sliced COO Format for Sparse Matrix-Vector Multiplication on CUDA-Enabled GPUs,” in Proceedings of the International Conference on Computational Science, ICCS, Omaha, Nebraska, USA, 4-6 June, 2012, ser. Procedia Computer Science, H. H. Ali, Y. Shi, D. Khazanchi, M. Lees, G. D. van Albada, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 9.   Elsevier, 2012, pp. 57–66. [Online]. Available: https://doi.org/10.1016/j.procs.2012.04.007
  22. E. F. D’Azevedo, M. R. Fahey, and R. T. Mills, “Vectorized sparse matrix multiply for compressed row storage format,” in Computational Science - ICCS 2005, 5th International Conference, Atlanta, GA, USA, May 22-25, 2005, Proceedings, Part I, ser. Lecture Notes in Computer Science, V. S. Sunderam, G. D. van Albada, P. M. A. Sloot, and J. J. Dongarra, Eds., vol. 3514.   Springer, 2005, pp. 99–106. [Online]. Available: https://doi.org/10.1007/11428831\_13
  23. A. Monakov, A. Lokhmotov, and A. Avetisyan, “Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures,” in High Performance Embedded Architectures and Compilers, 5th International Conference, HiPEAC 2010, Pisa, Italy, January 25-27, 2010. Proceedings, ser. Lecture Notes in Computer Science, Y. N. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell, Eds., vol. 5952.   Springer, 2010, pp. 111–125. [Online]. Available: https://doi.org/10.1007/978-3-642-11515-8\_10
  24. D. Barbieri, V. Cardellini, A. Fanfarillo, and S. Filippone, “Three Storage Formats for Sparse Matrices on GPGPUs,” 2015.
  25. M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop, “Sparse Matrix-Vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation,” in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum, 2012, pp. 1696–1702.
  26. M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors With Wide SIMD Units,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C401–C423, 2014.
  27. L. Yuan, Y. Zhang, X. Sun, and T. Wang, “Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format,” in 12th IEEE International Conference on High Performance Computing and Communications, HPCC, 1-3 September 2010, Melbourne, Australia.   IEEE, 2010, pp. 585–590. [Online]. Available: https://doi.org/10.1109/HPCC.2010.67
  28. X. Sun, Y. Zhang, T. Wang, G. Long, X. Zhang, and Y. Li, “CRSD: Application Specific Auto-tuning of SpMV for Diagonal Sparse Matrices,” in Euro-Par Parallel Processing - 17th International Conference, Euro-Par, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part II, ser. Lecture Notes in Computer Science, vol. 6853.   Springer, 2011, pp. 316–327. [Online]. Available: https://doi.org/10.1007/978-3-642-23397-5\_32
  29. X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan, and L. Rao, “Optimizing SpMV for Diagonal Sparse Matrices on GPU,” in International Conference on Parallel Processing, ICPP, Taipei, Taiwan, September 13-16, 2011.   IEEE Computer Society, 2011, pp. 492–501. [Online]. Available: https://doi.org/10.1109/ICPP.2011.53
  30. A. Pinar and M. T. Heath, “Improving performance of sparse matrix-vector multiplication,” in Proceedings of the ACM/IEEE Conference on Supercomputing, SC 1999, November 13-19, 1999, Portland, Oregon, USA.   ACM, 1999, p. 30. [Online]. Available: https://doi.org/10.1145/331532.331562
  31. Y. Li, P. Xie, X. Chen, J. Liu, B. Yang, S. Li, C. Gong, X. Gan, and H. Xu, “VBSF: a new storage format for SIMD sparse matrix-vector multiplication on modern processors,” The Journal of Supercomputing, vol. 76, no. 3, pp. 2063–2081, 2020. [Online]. Available: https://doi.org/10.1007/s11227-019-02835-4
  32. A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan, “An Efficient Two-Dimensional Blocking Strategy for Sparse Matrix-Vector Multiplication on GPUs,” in International Conference on Supercomputing, ICS’14, Muenchen, Germany, June 10-13, 2014, A. Bode, M. Gerndt, P. Stenström, L. Rauchwerger, B. P. Miller, and M. Schulz, Eds.   ACM, 2014, pp. 273–282. [Online]. Available: https://doi.org/10.1145/2597652.2597678
  33. S. Jain-Mendon and R. Sass, “A case study of streaming storage format for sparse matrices,” in 2012 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2012, Cancun, Mexico, December 5-7, 2012.   IEEE, 2012, pp. 1–6. [Online]. Available: https://doi.org/10.1109/ReConFig.2012.6416788
  34. S. Yan, C. Li, Y. Zhang, and H. Zhou, “yaSpMV: Yet Another SpMV Framework on GPUs,” in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’14.   New York, NY, USA: ACM, 2014, pp. 107–118. [Online]. Available: http://doi.acm.org/10.1145/2555243.2555255
  35. J. Godwin, J. Holewinski, and P. Sadayappan, “High-Performance Sparse Matrix-Vector Multiplication on GPUs for Structured Grid Computations,” in The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, London, United Kingdom, March 3, D. R. Kaeli, J. Cavazos, and E. Sun, Eds.   ACM, 2012, pp. 47–56. [Online]. Available: https://doi.org/10.1145/2159430.2159436
  36. J. Choi, A. Singh, and R. W. Vuduc, “Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs,” in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Bangalore, India, January 9-14, 2010.   ACM, 2010, pp. 115–126. [Online]. Available: https://doi.org/10.1145/1693453.1693471
  37. W. Liu and B. Vinter, “CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, L. N. Bhuyan, F. Chong, and V. Sarkar, Eds.   ACM, 2015, pp. 339–350. [Online]. Available: https://doi.org/10.1145/2751205.2751209
  38. H. Bian, J. Huang, R. Dong, L. Liu, and X. Wang, “CSR2: A New Format for SIMD-Accelerated SpMV,” in 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020, Melbourne, Australia, May 11-14, 2020.   IEEE, 2020, pp. 350–359. [Online]. Available: https://doi.org/10.1109/CCGrid49817.2020.00-58
  39. K. Kourtis, V. Karakasis, G. I. Goumas, and N. Koziris, “CSX: an extended compression format for spmv on shared memory systems,” in Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, February 12-16, 2011, C. Cascaval and P. Yew, Eds.   ACM, 2011, pp. 247–256. [Online]. Available: https://doi.org/10.1145/1941553.1941587
  40. B.-Y. Su and K. Keutzer, “ClSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs,” in Proceedings of the 26th ACM International Conference on Supercomputing, ser. ICS’12.   New York, NY, USA: Association for Computing Machinery, 2012, p. 353–364. [Online]. Available: https://doi.org/10.1145/2304576.2304624
  41. N. Bell and M. Garland, “Efficient Sparse Matrix-Vector Multiplication on CUDA,” NVIDIA Corporation, NVIDIA Technical Report NVR-2008-004, Dec. 2008.
  42. H. Anzt, T. Cojean, C. Yen-Chen, J. J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M. Tsai, and W. Wang, “Load-Balancing Sparse Matrix Vector Product Kernels on GPUs,” ACM Transactions on Parallel Computing, vol. 7, no. 1, pp. 2:1–2:26, 2020. [Online]. Available: https://doi.org/10.1145/3380930
  43. A. Maringanti, V. Athavale, and S. B. Patkar, “Acceleration of Conjugate Gradient Method for Circuit Simulation Using CUDA,” in 16th International Conference on High Performance Computing, December 16-19, 2009, Kochi, India, Proceedings, Y. Yang, M. Parashar, R. Muralidhar, and V. K. Prasanna, Eds.   IEEE Computer Society, 2009, pp. 438–444. [Online]. Available: https://doi.org/10.1109/HIPC.2009.5433184
  44. W. Yang, K. Li, and K. Li, “A parallel computing method using blocked format with optimal partitioning for spmv on GPU,” J. Comput. Syst. Sci., vol. 92, pp. 152–170, 2018. [Online]. Available: https://doi.org/10.1016/j.jcss.2017.09.010
  45. Y. Nagasaka, A. Nukada, and S. Matsuoka, “Adaptive Multi-Level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU,” Procedia Computer Science, vol. 80, pp. 131–142, 2016, international Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S187705091630655X
  46. J. Bolz, I. Farmer, E. Grinspun, and P. Schröder, “Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” ACM Transactions on Graphics (TOG), vol. 22, no. 3, pp. 917–924, 2003.
  47. W. Yang, K. Li, Y. Liu, L. Shi, and L. Wan, “Optimization of Quasi-Diagonal Matrix-Vector Multiplication on GPU,” The International Journal of High Performance Computing Applications, vol. 28, no. 2, pp. 183–195, 2014. [Online]. Available: https://doi.org/10.1177/1094342013501126
  48. J. Gao, W. Ji, Z. Tan, Y. Wang, and F. Shi, “Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on GPU,” IEEE Trans. Parallel Distributed Syst., vol. 33, no. 12, pp. 3732–3745, 2022. [Online]. Available: https://doi.org/10.1109/TPDS.2022.3170501
  49. W. T. Tang, R. Zhao, M. Lu, Y. Liang, H. P. Huyng, X. Li, and R. S. M. Goh, “Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi,” in Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, San Francisco, CA, USA, February 07 - 11, 2015, K. Olukotun, A. Smith, R. Hundt, and J. Mars, Eds.   IEEE Computer Society, 2015, pp. 136–145. [Online]. Available: https://doi.org/10.1109/CGO.2015.7054194
  50. A. Monakov and A. Avetisyan, “Implementing blocked sparse matrix-vector multiplication on NVIDIA gpus,” in Embedded Computer Systems: Architectures, Modeling, and Simulation, 9th International Workshop, SAMOS 2009, Samos, Greece, July 20-23, 2009. Proceedings, ser. Lecture Notes in Computer Science, K. Bertels, N. J. Dimopoulos, C. Silvano, and S. Wong, Eds., vol. 5657.   Springer, 2009, pp. 289–297. [Online]. Available: https://doi.org/10.1007/978-3-642-03138-0\_32
  51. Z. Tan, W. Ji, J. Gao, Y. Zhao, A. Benatia, Y. Wang, and F. Shi, “MMSparse: 2D Partitioning of Sparse Matrix Based on Mathematical Morphology,” Future Generation Computer Systems, vol. 108, pp. 521 – 532, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X19327967
  52. Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan, “TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs,” in 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021.   IEEE, 2021, pp. 68–78. [Online]. Available: https://doi.org/10.1109/IPDPS49936.2021.00016
  53. J. Willcock and A. Lumsdaine, “Accelerating sparse matrix computations via data compression,” in Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, Cairns, Queensland, Australia, June 28 - July 01, 2006, G. K. Egan and Y. Muraoka, Eds.   ACM, 2006, pp. 307–316. [Online]. Available: https://doi.org/10.1145/1183401.1183444
  54. K. Kourtis, G. I. Goumas, and N. Koziris, “Optimizing sparse matrix-vector multiplication using index and value compression,” in Proceedings of the 5th Conference on Computing Frontiers, 2008, Ischia, Italy, May 5-7, 2008, A. Ramírez, G. Bilardi, and M. Gschwind, Eds.   ACM, 2008, pp. 87–96. [Online]. Available: https://doi.org/10.1145/1366230.1366244
  55. W. T. Tang, W. J. Tan, R. Ray, Y. W. Wong, W. Chen, S. Kuo, R. S. M. Goh, S. J. Turner, and W. Wong, “Accelerating Sparse Matrix-Vector Multiplication on GPUs Using Bit-Representation-Optimized Schemes,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA - November 17 - 21, 2013, W. Gropp and S. Matsuoka, Eds.   ACM, 2013, pp. 26:1–26:12. [Online]. Available: https://doi.org/10.1145/2503210.2503234
  56. W. T. Tang, W. J. Tan, R. S. M. Goh, S. J. Turner, and W.-F. Wong, “A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 9, pp. 2373–2385, 2015.
  57. S. Kestur, J. D. Davis, and E. S. Chung, “Towards a universal FPGA matrix-vector multiplication architecture,” in 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2012, 29 April - 1 May 2012, Toronto, Ontario, Canada.   IEEE Computer Society, 2012, pp. 9–16. [Online]. Available: https://doi.org/10.1109/FCCM.2012.12
  58. J. M. Mellor-Crummey and J. Garvin, “Optimizing sparse matrix - vector product computations using unroll and jam,” Int. J. High Perform. Comput. Appl., vol. 18, no. 2, pp. 225–236, 2004. [Online]. Available: https://doi.org/10.1177/1094342004038951
  59. F. Vázquez, J. Fernández, and E. M. Garzón, “A new approach for sparse matrix vector product on NVIDIA gpus,” Concurr. Comput. Pract. Exp., vol. 23, no. 8, pp. 815–826, 2011. [Online]. Available: https://doi.org/10.1002/cpe.1658
  60. E. J. Anderson and Y. Saad, “Solving sparse triangular linear systems on parallel computers,” Int. J. High Speed Comput., vol. 1, no. 1, pp. 73–95, 1989. [Online]. Available: https://doi.org/10.1142/S0129053389000056
  61. R. G. Melhem, “Parallel solution of linear systems with striped sparse matrices,” Parallel Comput., vol. 6, no. 2, pp. 165–184, 1988. [Online]. Available: https://doi.org/10.1016/0167-8191(88)90082-8
  62. E. Montagne and A. Ekambaram, “An optimal storage format for sparse matrices,” Inf. Process. Lett., vol. 90, no. 2, pp. 87–92, 2004. [Online]. Available: https://doi.org/10.1016/j.ipl.2004.01.014
  63. M. M. Baskaran and R. Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs Using Compile-Time and Run-Time Strategies,” IBM Reserach Report, RC24704 (W0812-047), 2008.
  64. A. H. E. Zein and A. P. Rendell, “From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation,” in 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGrid, 17-20 May 2010, Melbourne, Victoria, Australia.   IEEE Computer Society, 2010, pp. 808–813. [Online]. Available: https://doi.org/10.1109/CCGRID.2010.81
  65. ——, “Generating Optimal CUDA Sparse Matrix-Vector Product Implementations for Evolving GPU Hardware,” Concurrency Computation: Practice and Experience, vol. 24, no. 1, pp. 3–13, 2012. [Online]. Available: https://doi.org/10.1002/cpe.1732
  66. S. Dalton, N. Bell, L. Olson, and M. Garland, “Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations,” 2014, version 0.5.0. [Online]. Available: http://cusplibrary.github.io/
  67. D. Mukunoki and D. Takahashi, “Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs,” in Computational Science and Its Applications - ICCSA 13th International Conference, Ho Chi Minh City, Vietnam, June 24-27, 2013, Proceedings, Part V, ser. Lecture Notes in Computer Science, vol. 7975.   Springer, 2013, pp. 211–223. [Online]. Available: https://doi.org/10.1007/978-3-642-39640-3\_15
  68. I. Reguly and M. Giles, “Efficient Sparse Matrix-Vector Multiplication on Cache-Based GPUs,” in 2012 Innovative Parallel Computing (InPar), May 2012, pp. 1–12.
  69. H. Yoshizawa and D. Takahashi, “Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs,” in 15th IEEE International Conference on Computational Science and Engineering, CSE 2012, Paphos, Cyprus, December 5-7, 2012.   IEEE Computer Society, 2012, pp. 130–136. [Online]. Available: https://doi.org/10.1109/ICCSE.2012.28
  70. Z. Koza, M. Matyka, S. Szkoda, and Ł. Mirosław, “Compressed Multirow Storage Format for Sparse Matrices on Graphics Processing Units,” SIAM Journal on Scientific Computing, vol. 36, no. 2, pp. C219–C239, 2014. [Online]. Available: https://doi.org/10.1137/120900216
  71. H. Anzt, T. Cojean, G. Flegar, F. Göbel, T. Grützmacher, P. Nayak, T. Ribizel, Y. M. Tsai, and E. S. Quintana-Ortí, “Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing,” ACM Transactions on Mathematical Software, vol. 48, no. 1, pp. 2:1–2:33, 2022. [Online]. Available: https://doi.org/10.1145/3480935
  72. G. Flegar and H. Anzt, “Overcoming load imbalance for irregular sparse matrices,” in Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms, IA3@SC 2017, Denver, CO, USA, November 12 - 17, 2017.   ACM, 2017, pp. 2:1–2:8. [Online]. Available: https://doi.org/10.1145/3149704.3149767
  73. S. Liu, Y. Zhang, X. Sun, and R. Qiu, “Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP,” in 11th IEEE International Conference on High Performance Computing and Communications, HPCC, 25-27 June 2009, Seoul, Korea.   IEEE, 2009, pp. 659–665. [Online]. Available: https://doi.org/10.1109/HPCC.2009.75
  74. X. Feng, H. Jin, R. Zheng, K. Hu, J. Zeng, and Z. Shao, “Optimization of sparse matrix-vector multiplication with variant CSR on gpus,” in 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS, Tainan, Taiwan, December 7-9, 2011.   IEEE Computer Society, 2011, pp. 165–172. [Online]. Available: https://doi.org/10.1109/ICPADS.2011.91
  75. J. L. Greathouse and M. Daga, “Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format,” in SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2014, pp. 769–780.
  76. M. Daga and J. L. Greathouse, “Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices,” in 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 2015, pp. 64–74.
  77. J. Gao, W. Ji, J. Liu, S. Shao, Y. Wang, and F. Shi, “AMF-CSR: adaptive multi-row folding of CSR for spmv on GPU,” in 27th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2021, Beijing, China, December 14-16, 2021.   IEEE, 2021, pp. 418–425. [Online]. Available: https://doi.org/10.1109/ICPADS53394.2021.00058
  78. Y. Liu and B. Schmidt, “LightSpMV: Faster CSR-Based Sparse Matrix-Vector Multiplication on CUDA-Enabled GPUs,” in 26th IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2015, Toronto, ON, Canada, July 27-29, 2015.   IEEE Computer Society, 2015, pp. 82–89. [Online]. Available: https://doi.org/10.1109/ASAP.2015.7245713
  79. ——, “LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows,” Journal of Signal Processing Systems, vol. 90, no. 1, pp. 69–86, 2018.
  80. D. Merrill and M. Garland, “Merge-Based Parallel Sparse Matrix-Vector Multiplication,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016, J. West and C. M. Pancake, Eds.   IEEE Computer Society, 2016, pp. 678–689. [Online]. Available: https://doi.org/10.1109/SC.2016.57
  81. G. Flegar and E. S. Quintana-Ortí, “Balanced CSR Sparse Matrix-Vector Product on Graphics Processors,” in Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28 - September 1, 2017, Proceedings, ser. Lecture Notes in Computer Science, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, Eds., vol. 10417.   Springer, 2017, pp. 697–709. [Online]. Available: https://doi.org/10.1007/978-3-319-64203-1\_50
  82. M. Steinberger, R. Zayer, and H. Seidel, “Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU,” in Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14-16, 2017, W. D. Gropp, P. Beckman, Z. Li, and F. J. Cazorla, Eds.   ACM, 2017, pp. 13:1–13:11. [Online]. Available: https://doi.org/10.1145/3079079.3079086
  83. E. Im and K. A. Yelick, “Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY,” in Computational Science - ICCS 2001, International Conference, San Francisco, CA, USA, May 28-30, 2001. Proceedings, Part I, ser. Lecture Notes in Computer Science, V. N. Alexandrov, J. J. Dongarra, B. A. Juliano, R. S. Renner, and C. J. K. Tan, Eds., vol. 2073.   Springer, 2001, pp. 127–136. [Online]. Available: https://doi.org/10.1007/3-540-45545-0\_22
  84. E. Im, K. A. Yelick, and R. W. Vuduc, “Sparsity: Optimization Framework for Sparse Matrix Kernels,” The International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 135–158, 2004. [Online]. Available: https://doi.org/10.1177/1094342004041296
  85. R. Vuduc, J. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. C. Lee, “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” in Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, November 16-22, 2002, CD-ROM, R. C. Giles, D. A. Reed, and K. Kelley, Eds.   IEEE Computer Society, 2002, pp. 35:1–35:35. [Online]. Available: https://doi.org/10.1109/SC.2002.10025
  86. R. Vuduc, J. W. Demmel, and K. A. Yelick, “OSKI: A Library of Automatically Tuned Sparse Matrix Kernels,” in Journal of Physics: Conference Series, vol. 16, no. 1.   IOP Publishing, 2005, p. 521.
  87. R. Nishtala, R. W. Vuduc, J. Demmel, and K. A. Yelick, “When cache blocking of sparse matrix vector multiply works and why,” Appl. Algebra Eng. Commun. Comput., vol. 18, no. 3, pp. 297–311, 2007. [Online]. Available: https://doi.org/10.1007/s00200-007-0038-9
  88. X. Zhang, Y. Zhang, X. Sun, F. Liu, S. Liu, Y. Tang, and Y. Li, “Automatic performance tuning of spmv on gpgpu,” HPC Asia, Kaohsiung, Taiwan, China, pp. 173–179, 2009.
  89. P. Guo, L. Wang, and P. Chen, “A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 5, pp. 1112–1123, 2014. [Online]. Available: https://doi.org/10.1109/TPDS.2013.123
  90. K. Li, W. Yang, and K. Li, “Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 1, pp. 196–205, 2015.
  91. F. Vázquez, J. Fernández, and E. M. Garzón, “Automatic Tuning of the Sparse Matrix Vector Product on GPUs Based on the ELLR-T Approach,” Parallel Computing, vol. 38, no. 8, pp. 408–420, 2012. [Online]. Available: https://doi.org/10.1016/j.parco.2011.08.003
  92. F. Vázquez, G. O. López, J. Fernández, and E. M. Garzón, “Improving the performance of the sparse matrix vector product with gpus,” in 10th IEEE International Conference on Computer and Information Technology, CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010.   IEEE Computer Society, 2010, pp. 1146–1151. [Online]. Available: https://doi.org/10.1109/CIT.2010.208
  93. S. Li, C. Hu, J. Zhang, and Y. Zhang, “Automatic tuning of sparse matrix-vector multiplication on multicore clusters,” Sci. China Inf. Sci., vol. 58, no. 9, pp. 1–14, 2015. [Online]. Available: https://doi.org/10.1007/s11432-014-5254-x
  94. Y. Chen, G. Xiao, Z. Xiao, and W. Yang, “hpspmv: A heterogeneous parallel computing scheme for spmv on the sunway taihulight supercomputer,” in 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019, Zhangjiajie, China, August 10-12, 2019, Z. Xiao, L. T. Yang, P. Balaji, T. Li, K. Li, and A. Y. Zomaya, Eds.   IEEE, 2019, pp. 989–995. [Online]. Available: https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00142
  95. P. Guo and L. Wang, “Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs,” in International Conference on Computational and Information Sciences.   IEEE, 2010, pp. 1154–1157.
  96. W. A. Abu-Sufah and A. A. Karim, “Auto-tuning of sparse matrix-vector multiplication on graphics processors,” in Supercomputing - 28th International Supercomputing Conference, ISC 2013, Leipzig, Germany, June 16-20, 2013. Proceedings, ser. Lecture Notes in Computer Science, J. M. Kunkel, T. Ludwig, and H. W. Meuer, Eds., vol. 7905.   Springer, 2013, pp. 151–164. [Online]. Available: https://doi.org/10.1007/978-3-642-38750-0\_12
  97. ——, “An effective approach for implementing sparse matrix-vector multiplication on graphics processing units,” in 14th IEEE International Conference on High Performance Computing and Communication & 9th IEEE International Conference on Embedded Software and Systems, HPCC-ICESS 2012, Liverpool, United Kingdom, June 25-27, 2012, G. Min, J. Hu, L. C. Liu, L. T. Yang, S. Seelam, and L. Lefèvre, Eds.   IEEE Computer Society, 2012, pp. 453–460. [Online]. Available: https://doi.org/10.1109/HPCC.2012.68
  98. W. Armstrong and A. P. Rendell, “Reinforcement Learning for Automated Performance Tuning: Initial Evaluation for Sparse Matrix Format Selection,” in 2008 IEEE International Conference on Cluster Computing, 2008, pp. 411–420.
  99. J. Li, G. Tan, M. Chen, and N. Sun, “SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication,” ACM SIGPLAN Notices, vol. 48, no. 6, p. 117–126, jun 2013. [Online]. Available: https://doi.org/10.1145/2499370.2462181
  100. N. Sedaghati, T. Mu, L. Pouchet, S. Parthasarathy, and P. Sadayappan, “Automatic selection of sparse matrix representation on gpus,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, L. N. Bhuyan, F. Chong, and V. Sarkar, Eds.   ACM, 2015, pp. 99–108. [Online]. Available: https://doi.org/10.1145/2751205.2751244
  101. S. Chen, J. Fang, D. Chen, C. Xu, and Z. Wang, “Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures,” in 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2018, pp. 649–658.
  102. A. Benatia, W. Ji, Y. Wang, and F. Shi, “Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU,” in 2016 45th International Conference on Parallel Processing (ICPP), 2016, pp. 496–505.
  103. I. Mehrez, O. Hamdi-Larbi, T. Dufaud, and N. Emad, “Machine Learning for Optimal Compression Format Prediction on Multiprocessor Platform,” in 2018 International Conference on High Performance Computing Simulation (HPCS), 2018, pp. 213–220.
  104. K. Hou, W. Feng, and S. Che, “Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017, pp. 713–722.
  105. A. Benatia, W. Ji, Y. Wang, and F. Shi, “BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU,” ACM Transactions on Architecture and Code Optimization, vol. 15, no. 3, Sep. 2018. [Online]. Available: https://doi.org/10.1145/3226228
  106. O. Hamdi-Larbi, I. Mehrez, and T. Dufaud, “Machine learning to design an auto-tuning system for the best compressed format detection for parallel sparse computations,” Parallel Process. Lett., vol. 31, no. 4, pp. 2 150 019:1–2 150 019:37, 2021. [Online]. Available: https://doi.org/10.1142/S0129626421500195
  107. I. Nisa, C. Siegel, A. S. Rajam, A. Vishnu, and P. Sadayappan, “Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs,” in 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018, pp. 1056–1065.
  108. H. Cui, S. Hirasawa, H. Kobayashi, and H. Takizawa, “A machine learning-based approach for selecting spmv kernels and matrix storage formats,” IEICE Trans. Inf. Syst., vol. 101-D, no. 9, pp. 2307–2314, 2018. [Online]. Available: https://doi.org/10.1587/transinf.2017EDP7176
  109. Y. Zhao, J. Li, C. Liao, and X. Shen, “Bridging the Gap Between Deep Learning and Sparse Matrix Format Selection,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 94–108. [Online]. Available: https://doi.org/10.1145/3178487.3178495
  110. W. Zhou, Y. Zhao, X. Shen, and W. Chen, “Enabling Runtime SpMV Format Selection through an Overhead Conscious Method,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 80–93, 2020.
  111. A. Elafrou, G. Goumas, and N. Koziris, “BASMAT: Bottleneck-Aware Sparse Matrix-Vector Multiplication Auto-Tuning on GPGPUs,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP), Washington, District of Columbia.   New York, NY, USA: Association for Computing Machinery, 2019, p. 423–424. [Online]. Available: https://doi.org/10.1145/3293883.3301490
  112. E. Dufrechou, P. Ezzatti, and E. S. Quintana-Ortí, “Selecting Optimal SpMV Realizations for GPUs via Machine Learning,” The International Journal of High Performance Computing Applications, vol. 35, no. 3, 2021. [Online]. Available: https://doi.org/10.1177/1094342021990738
  113. G. Xiao, T. Zhou, Y. Chen, Y. Hu, and K. Li, “Dtspmv: An adaptive spmv framework for graph analysis on gpus,” in 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application, HPCC/DSS/SmartCity/DependSys 2022, Hainan, China, December 18-20, 2022.   IEEE, 2022, pp. 35–42. [Online]. Available: https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00039
  114. S. Usman, R. Mehmood, I. Katib, A. Albeshri, and S. M. Altowaijri, “Zaki: A smart method and tool for automatic performance optimization of parallel spmv computations on distributed memory machines,” Mobile Networks and Applications, pp. 1–20, 2019.
  115. S. Usman, R. Mehmood, I. A. Katib, and A. Albeshri, “ZAKI+: A machine learning based process mapping tool for spmv computations on distributed memory architectures,” IEEE Access, vol. 7, pp. 81 279–81 296, 2019. [Online]. Available: https://doi.org/10.1109/ACCESS.2019.2923565
  116. M. Ahmed, S. Usman, N. A. Shah, M. U. Ashraf, A. M. Alghamdi, A. A. Bahadded, and K. A. Almarhabi, “AAQAL: A Machine Learning-Based Tool for Performance Optimization of Parallel SpMV Computations Using Block CSR,” Applied Sciences, vol. 12, no. 14, p. 7073, 2022.
  117. J. Gao, W. Ji, J. Liu, Y. Wang, and F. Shi, “Revisiting thread configuration of spmv kernels on gpu: A machine learning based approach,” Journal of Parallel and Distributed Computing, vol. 185, p. 104799, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731523001697
  118. A. Benatia, W. Ji, Y. Wang, and F. Shi, “Machine Learning Approach for the Predicting Performance of SpMV on GPU,” in 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 2016, pp. 894–901.
  119. ——, “Sparse Matrix Partitioning for Optimizing SpMV on CPU-GPU Heterogeneous Platforms,” The International Journal of High Performance Computing Applications, vol. 34, no. 1, pp. 66–80, 2020. [Online]. Available: https://doi.org/10.1177/1094342019886628
  120. M. Barreda, M. F. Dolz, M. A. Castaño, P. Alonso-Jordá, and E. S. Quintana-Orti, “Performance Modeling of the Sparse Matrix-Vector Product via Convolutional Neural Networks,” The Journal of Supercomputing, vol. 76, no. 11, pp. 8883–8900, 2020.
  121. M. Barreda, M. F. Dolz, and M. A. Castano, “Convolutional Neural Nets for Estimating the Run Time and Energy Consumption of the Sparse Matrix-Vector Product,” The International Journal of High Performance Computing Applications, vol. 35, no. 3, pp. 268–281, 2021.
  122. E. C. Carson and N. J. Higham, “Accelerating the solution of linear systems by iterative refinement in three precisions,” SIAM J. Sci. Comput., vol. 40, no. 2, 2018. [Online]. Available: https://doi.org/10.1137/17M1140819
  123. S. Gratton, E. Simon, D. Titley-Péloquin, and P. L. Toint, “Exploiting variable precision in GMRES,” CoRR, vol. abs/1907.10550, 2019. [Online]. Available: http://arxiv.org/abs/1907.10550
  124. J. I. Aliaga, H. Anzt, T. Grützmacher, E. S. Quintana-Ortí, and A. E. Tomás, “Compressed basis GMRES on high-performance graphics processing units,” Int. J. High Perform. Comput. Appl., vol. 37, no. 2, pp. 82–100, 2023. [Online]. Available: https://doi.org/10.1177/10943420221115140
  125. J. A. Loe, C. A. Glusa, I. Yamazaki, E. G. Boman, and S. Rajamanickam, “A study of mixed precision strategies for gmres on gpus,” arXiv preprint arXiv:2109.01232, 2021.
  126. M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory computing,” Nature Electronics, vol. 1, no. 4, pp. 246–253, 2018.
  127. N. Lindquist, P. Luszczek, and J. J. Dongarra, “Improving the performance of the GMRES method using mixed-precision techniques,” in Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI - 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020, Oak Ridge, TN, USA, August 26-28, 2020, Revised Selected Papers, ser. Communications in Computer and Information Science, J. Nichols, B. Verastegui, A. B. Maccabe, O. R. Hernandez, S. Parete-Koon, and T. Ahearn, Eds., vol. 1315.   Springer, 2020, pp. 51–66. [Online]. Available: https://doi.org/10.1007/978-3-030-63393-6\_4
  128. K. Ahmad, H. Sundar, and M. W. Hall, “Data-driven mixed precision sparse matrix vector multiplication for gpus,” ACM Trans. Archit. Code Optim., vol. 16, no. 4, pp. 51:1–51:24, 2020. [Online]. Available: https://doi.org/10.1145/3371275
  129. D. Mukunoki and T. Ogita, “Performance and energy consumption of accurate and mixed-precision linear algebra kernels on gpus,” J. Comput. Appl. Math., vol. 372, p. 112701, 2020. [Online]. Available: https://doi.org/10.1016/j.cam.2019.112701
  130. E. Tezcan, T. Torun, F. Kosar, K. Kaya, and D. Unat, “Mixed and multi-precision spmv for gpus with row-wise precision selection,” in 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Bordeaux, France, November 2-5, 2022.   IEEE, 2022, pp. 31–40. [Online]. Available: https://doi.org/10.1109/SBAC-PAD55451.2022.00014
  131. S. Graillat, F. Jézéquel, T. Mary, and R. Molina, “Adaptive precision matrix-vector product,” Feb. 2022, working paper or preprint. [Online]. Available: https://hal.science/hal-03561193
  132. K. Isupov, “Multiple-precision sparse matrix-vector multiplication on gpus,” J. Comput. Sci., vol. 61, p. 101609, 2022. [Online]. Available: https://doi.org/10.1016/j.jocs.2022.101609
  133. T. Kouya, “A highly efficient implementation of multiple precision sparse matrix-vector multiplication and its application to product-type krylov subspace methods,” CoRR, vol. abs/1411.2377, 2014. [Online]. Available: http://arxiv.org/abs/1411.2377
  134. H. Pabst, B. Bachmayer, and M. Klemm, “Performance of a structure-detecting spmv using the CSR matrix representation,” in 11th International Symposium on Parallel and Distributed Computing, ISPDC 2012, Munich, Germany, June 25-29, 2012, M. Bader, H. Bungartz, D. Grigoras, M. Mehl, R. Mundani, and R. Potolea, Eds.   IEEE Computer Society, 2012, pp. 3–10. [Online]. Available: https://doi.org/10.1109/ISPDC.2012.9
  135. Y. Zhang, W. Yang, K. Li, D. Tang, and K. Li, “Performance analysis and optimization for spmv based on aligned storage formats on an ARM processor,” J. Parallel Distributed Comput., vol. 158, pp. 126–137, 2021. [Online]. Available: https://doi.org/10.1016/j.jpdc.2021.08.002
  136. S. Williams, L. Oliker, R. W. Vuduc, J. Shalf, K. A. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging multicore platforms,” in Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, SC 2007, November 10-16, 2007, Reno, Nevada, USA, B. Verastegui, Ed.   ACM Press, 2007, p. 38. [Online]. Available: https://doi.org/10.1145/1362622.1362674
  137. ——, “Optimization of sparse matrix-vector multiplication on emerging multicore platforms,” Parallel Comput., vol. 35, no. 3, pp. 178–194, 2009. [Online]. Available: https://doi.org/10.1016/j.parco.2008.12.006
  138. O. Kislal, W. Ding, M. Kandemir, and I. Demirkiran, “Optimizing sparse matrix vector multiplication on emerging multicores,” in 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS).   IEEE, 2013, pp. 1–10.
  139. E. Yuan, Y. Zhang, and X. Sun, “Memory access complexity analysis of spmv in RAM (h) model,” in 10th IEEE International Conference on High Performance Computing and Communications, HPCC 2008, 25-27 Sept. 2008, Dalian, China.   IEEE Computer Society, 2008, pp. 913–920. [Online]. Available: https://doi.org/10.1109/HPCC.2008.130
  140. B. C. Lee, R. Vuduc, J. W. Demmel, K. A. Yelick, M. deLorimier, and L. Zhong, “Performance optimizations and bounds for sparse symmetric matrix-multiple vector multiply,” University of California, Berkeley, Berkeley, CA, USA, Tech. Rep. UCB/CSD-03-1297, 2003.
  141. A. Elafrou, G. I. Goumas, and N. Koziris, “Performance analysis and optimization of sparse matrix-vector multiplication on modern multi- and many-core processors,” in 46th International Conference on Parallel Processing, ICPP 2017, Bristol, United Kingdom, August 14-17, 2017.   IEEE Computer Society, 2017, pp. 292–301. [Online]. Available: https://doi.org/10.1109/ICPP.2017.38
  142. J. D. Trotter, S. Ekmekçibasi, J. Langguth, T. Torun, E. Düzakin, A. Ilic, and D. Unat, “Bringing order to sparsity: A sparse matrix reordering study on multicore cpus,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, D. Arnold, R. M. Badia, and K. M. Mohror, Eds.   ACM, 2023, pp. 31:1–31:13. [Online]. Available: https://doi.org/10.1145/3581784.3607046
  143. X. Yu, H. Ma, Z. Qu, J. Fang, and W. Liu, “Numa-aware optimization of sparse matrix-vector multiplication on armv8-based many-core architectures,” in Network and Parallel Computing - 17th IFIP WG 10.3 International Conference, NPC 2020, Zhengzhou, China, September 28-30, 2020, Revised Selected Papers, ser. Lecture Notes in Computer Science, X. He, E. Shao, and G. Tan, Eds., vol. 12639.   Springer, 2020, pp. 231–242. [Online]. Available: https://doi.org/10.1007/978-3-030-79478-1\_20
  144. J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable Parallel Programming With CUDA: Is CUDA the Parallel Programming Model That Application Developers Have Been Waiting For?” Queue, vol. 6, no. 2, p. 40–53, Mar 2008.
  145. M. M. Baskaran and R. Bordawekar, “Optimizing Sparse Matrix-Vector Multiplication on GPUs,” IBM Research Report RC24704, no. W0812–047, 2009.
  146. Y. Deng, B. D. Wang, and S. Mu, “Taming irregular EDA applications on gpus,” in 2009 International Conference on Computer-Aided Design, ICCAD 2009, San Jose, CA, USA, November 2-5, 2009, J. S. Roychowdhury, Ed.   ACM, 2009, pp. 539–546. [Online]. Available: https://doi.org/10.1145/1687399.1687501
  147. K. He, S. X. Tan, H. Zhao, X. Liu, H. Wang, and G. Shi, “Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms,” Integr., vol. 52, pp. 10–22, 2016. [Online]. Available: https://doi.org/10.1016/j.vlsi.2015.07.005
  148. E. Karimi, N. B. Agostini, S. Dong, and D. R. Kaeli, “VCSR: an efficient GPU memory-aware sparse format,” IEEE Trans. Parallel Distributed Syst., vol. 33, no. 10, pp. 3977–3989, 2022. [Online]. Available: https://doi.org/10.1109/TPDS.2022.3177291
  149. Y. Lu and W. Liu, “DASP: specific dense matrix multiply-accumulate units accelerated general sparse matrix-vector multiplication,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, D. Arnold, R. M. Badia, and K. M. Mohror, Eds.   ACM, 2023, pp. 73:1–73:14. [Online]. Available: https://doi.org/10.1145/3581784.3607051
  150. D. Grewe and A. Lokhmotov, “Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation,” in Proceedings of 4th Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2011, Newport Beach, CA, USA, March 5, 2011.   ACM, 2011, p. 12. [Online]. Available: https://doi.org/10.1145/1964179.1964196
  151. A. Cevahir, A. Nukada, and S. Matsuoka, “Fast Conjugate Gradients with Multiple GPUs,” in Computational Science - ICCS, 9th International Conference, Baton Rouge, LA, USA, May 25-27, 2009, Proceedings, Part I, ser. Lecture Notes in Computer Science, G. Allen, J. Nabrzyski, E. Seidel, G. D. van Albada, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 5544.   Springer, 2009, pp. 893–903. [Online]. Available: https://doi.org/10.1007/978-3-642-01970-8\_90
  152. P. Guo and C. Zhang, “Performance Optimization for SpMV on Multi-GPU Systems Using Threads and Multiple Streams,” in International Symposium on Computer Architecture and High Performance Computing Workshops, SBAC-PAD Workshops , Los Angeles, CA, USA, October 26-28, 2016.   IEEE Computer Society, 2016, pp. 67–72. [Online]. Available: https://doi.org/10.1109/SBAC-PADW.2016.20
  153. M. Karwacki, B. Bylina, and J. Bylina, “Multi-GPU Implementation of the Uniformization Method for Solving Markov Models,” in Federated conference on computer science and information systems (FedCSIS).   IEEE, 2012, pp. 533–537.
  154. M. Verschoor and A. C. Jalba, “Analysis and Performance Estimation of the Conjugate Gradient Method on Multiple GPUs,” Parallel Computing, vol. 38, no. 10-11, pp. 552–575, 2012. [Online]. Available: https://doi.org/10.1016/j.parco.2012.07.002
  155. B. Yang, H. Liu, and Z. Chen, “Preconditioned GMRES Solver on Multiple-GPU Architecture,” Computers and Mathematics with Applications, vol. 72, no. 4, pp. 1076–1095, 2016. [Online]. Available: https://doi.org/10.1016/j.camwa.2016.06.027
  156. G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998. [Online]. Available: https://doi.org/10.1137/S1064827595287997
  157. J. Gao, Y. Wang, J. Wang, and R. Liang, “Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs,” ACM Transactions on Parallel Computing, vol. 3, no. 3, pp. 16:1–16:33, 2016. [Online]. Available: https://doi.org/10.1145/2990849
  158. J. Gao, Y. Zhou, G. He, and Y. Xia, “A Multi-GPU Parallel Optimization Model for the Preconditioned Conjugate Gradient Algorithm,” Parallel Computing, vol. 63, pp. 1–16, 2017. [Online]. Available: https://doi.org/10.1016/j.parco.2017.04.003
  159. J. Gao, Y. Wang, and J. Wang, “A Novel Multi-Graphics Processing Unit Parallel Optimization Framework for the Sparse Matrix-Vector Multiplication,” Concurrency Computation Practice and Experience, vol. 29, no. 5, 2017. [Online]. Available: https://doi.org/10.1002/cpe.3936
  160. C. Li, M. Tang, R. Tong, M. Cai, J. Zhao, and D. Manocha, “P-cloth: interactive complex cloth simulation on multi-gpu systems using dynamic matrix assembly and pipelined implicit integrators,” ACM Trans. Graph., vol. 39, no. 6, pp. 180:1–180:15, 2020. [Online]. Available: https://doi.org/10.1145/3414685.3417763
  161. J. Chen, C. Xie, J. S. Firoz, J. Li, S. L. Song, K. J. Barker, M. Raugas, and A. Li, “MSREP: A fast yet light sparse matrix framework for multi-gpu systems,” CoRR, vol. abs/2209.07552, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.07552
  162. D. Schaa and D. R. Kaeli, “Exploring the Multiple-GPU Design Space,” in 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009.   IEEE, 2009, pp. 1–12. [Online]. Available: https://doi.org/10.1109/IPDPS.2009.5161068
  163. A. Abdelfattah, H. Ltaief, and D. E. Keyes, “High Performance Multi-GPU SpMV for Multi-Component PDE-Based Applications,” in Euro-Par: Parallel Processing - 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings, ser. Lecture Notes in Computer Science, J. L. Träff, S. Hunold, and F. Versaci, Eds., vol. 9233.   Springer, 2015, pp. 601–612. [Online]. Available: https://doi.org/10.1007/978-3-662-48096-0\_46
  164. Y. Shan, T. Wu, Y. Wang, B. Wang, Z. Wang, N. Xu, and H. Yang, “FPGA and GPU implementation of large scale spmv,” in IEEE 8th Symposium on Application Specific Processors, SASP 2010, Anaheim, CA, USA, June 13-14, 2010.   IEEE Computer Society, 2010, pp. 64–70. [Online]. Available: https://doi.org/10.1109/SASP.2010.5521144
  165. Y. Umuroglu and M. Jahre, “An energy efficient column-major backend for FPGA spmv accelerators,” in 32nd IEEE International Conference on Computer Design, ICCD 2014, Seoul, South Korea, October 19-22, 2014.   IEEE Computer Society, 2014, pp. 432–439. [Online]. Available: https://doi.org/10.1109/ICCD.2014.6974716
  166. J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt, “A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication,” in 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, Boston, MA, USA, May 11-13, 2014.   IEEE Computer Society, 2014, pp. 36–43. [Online]. Available: https://doi.org/10.1109/FCCM.2014.23
  167. J. Naher, C. Gloster, C. C. Doss, and S. S. Jadhav, “Using machine learning to estimate utilization and throughput for opencl-based matrix-vector multiplication (MVM),” in 10th Annual Computing and Communication Workshop and Conference, CCWC 2020, Las Vegas, NV, USA, January 6-8, 2020.   IEEE, 2020, pp. 365–372. [Online]. Available: https://doi.org/10.1109/CCWC47524.2020.9031173
  168. Y. Umuroglu and M. Jahre, “A vector caching scheme for streaming FPGA spmv accelerators,” in Applied Reconfigurable Computing - 11th International Symposium, ARC 2015, Bochum, Germany, April 13-17, 2015, Proceedings, ser. Lecture Notes in Computer Science, K. Sano, D. Soudris, M. Hübner, and P. C. Diniz, Eds., vol. 9040.   Springer, 2015, pp. 15–26. [Online]. Available: https://doi.org/10.1007/978-3-319-16214-0\_2
  169. ——, “Random access schemes for efficient FPGA spmv acceleration,” Microprocess. Microsystems, vol. 47, pp. 321–332, 2016. [Online]. Available: https://doi.org/10.1016/j.micpro.2016.02.015
  170. F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. T. Pileggi, and F. Franchetti, “Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019.   ACM, 2019, pp. 347–358. [Online]. Available: https://doi.org/10.1145/3352460.3358330
  171. M. Hosseinabady and J. L. Núñez-Yáñez, “A streaming dataflow engine for sparse matrix-vector multiplication using high-level synthesis,” IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 39, no. 6, pp. 1272–1285, 2020. [Online]. Available: https://doi.org/10.1109/TCAD.2019.2912923
  172. G. Oyarzun, D. Peyrolon, C. Álvarez, and X. Martorell, “An FPGA cached sparse matrix vector product (spmv) for unstructured computational fluid dynamics simulations,” CoRR, vol. abs/2107.12371, 2021. [Online]. Available: https://arxiv.org/abs/2107.12371
  173. A. Parravicini, L. G. Cellamare, M. Siracusa, and M. D. Santambrogio, “Scaling up HBM efficiency of top-k spmv for approximate embedding similarity on fpgas,” in 58th ACM/IEEE Design Automation Conference, DAC 2021, San Francisco, CA, USA, December 5-9, 2021.   IEEE, 2021, pp. 799–804. [Online]. Available: https://doi.org/10.1109/DAC18074.2021.9586203
  174. B. Liu and D. Liu, “Towards high-bandwidth-utilization spmv on fpgas via partial vector duplication,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference, ASPDAC 2023, Tokyo, Japan, January 16-19, 2023, A. Takahashi, Ed.   ACM, 2023, pp. 33–38. [Online]. Available: https://doi.org/10.1145/3566097.3567839
  175. M. Mahadurkar, N. Sivanandan, and S. Kala, “Hardware acceleration of spmv multiplier for deep learning,” in 25th International Symposium on VLSI Design and Test, VDAT 2021, Surat, India, September 16-18, 2021.   IEEE, 2021, pp. 1–6. [Online]. Available: https://doi.org/10.1109/VDAT53777.2021.9600988
  176. T. Nguyen, C. MacLean, M. Siracusa, D. Doerfler, N. J. Wright, and S. Williams, “Fpga-based HPC accelerators: An evaluation on performance and energy efficiency,” Concurr. Comput. Pract. Exp., vol. 34, no. 20, 2022. [Online]. Available: https://doi.org/10.1002/cpe.6570
  177. F. Favaro, E. Dufrechou, J. P. Oliver, and P. Ezzatti, “Optimizing the performance of the sparse matrix–vector multiplication kernel in fpga guided by the roofline model,” Micromachines, vol. 14, no. 11, p. 2030, Oct. 2023. [Online]. Available: http://dx.doi.org/10.3390/mi14112030
  178. X. Xie, Z. Liang, P. Gu, A. Basak, L. Deng, L. Liang, X. Hu, and Y. Xie, “Spacea: Sparse matrix vector multiplication on processing-in-memory accelerator,” in IEEE International Symposium on High-Performance Computer Architecture, HPCA 2021, Seoul, South Korea, February 27 - March 3, 2021.   IEEE, 2021, pp. 570–583. [Online]. Available: https://doi.org/10.1109/HPCA51647.2021.00055
  179. W. Sun, Z. Li, S. Yin, S. Wei, and L. Liu, “ABC-DIMM: alleviating the bottleneck of communication in dimm-based near-memory processing with inter-dimm broadcast,” in 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14-18, 2021.   IEEE, 2021, pp. 237–250. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00027
  180. C. Giannoula, I. Fernandez, J. Gómez-Luna, N. Koziris, G. I. Goumas, and O. Mutlu, “Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures,” Proc. ACM Meas. Anal. Comput. Syst., vol. 6, no. 1, pp. 21:1–21:49, 2022. [Online]. Available: https://doi.org/10.1145/3508041
  181. X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, “Efficient sparse matrix-vector multiplication on x86-based many-core processors,” in International Conference on Supercomputing, ICS’13, Eugene, OR, USA - June 10 - 14, 2013, A. D. Malony, M. Nemirovsky, and S. P. Midkiff, Eds.   ACM, 2013, pp. 273–282. [Online]. Available: https://doi.org/10.1145/2464996.2465013
  182. X. Chen, P. Xie, L. Chi, J. Liu, and C. Gong, “An efficient SIMD compression format for sparse matrix-vector multiplication,” Concurr. Comput. Pract. Exp., vol. 30, no. 23, 2018. [Online]. Available: https://doi.org/10.1002/cpe.4800
  183. B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He, and L. Zhang, “CVR: efficient vectorization of spmv on x86 processors,” in Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, Vösendorf / Vienna, Austria, February 24-28, 2018, J. Knoop, M. Schordan, T. Johnson, and M. F. P. O’Boyle, Eds.   ACM, 2018, pp. 149–162. [Online]. Available: https://doi.org/10.1145/3168818
  184. C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu, “Towards efficient spmv on sunway manycore architectures,” in Proceedings of the 32nd International Conference on Supercomputing, ICS 2018, Beijing, China, June 12-15, 2018.   ACM, 2018, pp. 363–373. [Online]. Available: https://doi.org/10.1145/3205289.3205313
  185. Y. Chen, G. Xiao, F. Wu, Z. Tang, and K. Li, “tpspmv: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures,” Inf. Sci., vol. 523, pp. 279–295, 2020. [Online]. Available: https://doi.org/10.1016/j.ins.2020.03.020
  186. G. Xiao, Y. Chen, C. Liu, and X. Zhou, “ahspmv: An autotuning hybrid computing scheme for spmv on the sunway architecture,” IEEE Internet Things J., vol. 7, no. 3, pp. 1736–1744, 2020. [Online]. Available: https://doi.org/10.1109/JIOT.2019.2947257
  187. I. Mehrez and O. Hamdi-Larbi, “SMVP distribution using hypergraph model and S-GBNZ algorithm,” in Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, 3PGCIC 2013, Compiegne, France, October 28-30, 2013, F. Xhafa, L. Barolli, D. Nace, S. Venticinque, and A. Bui, Eds.   IEEE, 2013, pp. 235–241. [Online]. Available: https://doi.org/10.1109/3PGCIC.2013.41
  188. H. Mi, X. Yu, X. Yu, S. Wu, and W. Liu, “Balancing computation and communication in distributed sparse matrix-vector multiplication,” in 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023, Bangalore, India, May 1-4, 2023, Y. Simmhan, I. Altintas, A. L. Varbanescu, P. Balaji, A. S. Prasad, and L. Carnevale, Eds.   IEEE, 2023, pp. 535–544. [Online]. Available: https://doi.org/10.1109/CCGrid57682.2023.00056
  189. B. A. Page and P. M. Kogge, “Scalability of hybrid spmv with hypergraph partitioning and vertex delegation for communication avoidance,” in International Conference on High Performance Computing & Simulation (HPCS 2020), 2021.
  190. C. Mayer, R. Mayer, S. Bhowmik, L. Epple, and K. Rothermel, “HYPE: massive hypergraph partitioning with neighborhood expansion,” in IEEE International Conference on Big Data (IEEE BigData 2018), Seattle, WA, USA, December 10-13, 2018, N. Abe, H. Liu, C. Pu, X. Hu, N. K. Ahmed, M. Qiao, Y. Song, D. Kossmann, B. Liu, K. Lee, J. Tang, J. He, and J. S. Saltz, Eds.   IEEE, 2018, pp. 458–467. [Online]. Available: https://doi.org/10.1109/BigData.2018.8621968
  191. S. Lin and Z. Xie, “A Jacobi_PCG Solver for Sparse Linear Systems on Multi-GPU Cluster,” The Journal of Supercomputing, vol. 73, no. 1, pp. 433–454, 2017. [Online]. Available: https://doi.org/10.1007/s11227-016-1887-4
  192. L. Z. Khodja, R. Couturier, A. Giersch, and J. M. Bahi, “Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters,” J. Supercomput., vol. 69, no. 1, pp. 200–224, 2014. [Online]. Available: https://doi.org/10.1007/s11227-014-1143-8
  193. B. Bylina, J. Bylina, P. Stpiczynski, and D. Szalkowski, “Performance analysis of multicore and multinodal implementation of spmv operation,” in Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, Warsaw, Poland, September 7-10, 2014, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 2, 2014, pp. 569–576. [Online]. Available: https://doi.org/10.15439/2014F313
  194. S. Lee and R. Eigenmann, “Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems,” in Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7-12, 2008, P. Zhou, Ed.   ACM, 2008, pp. 195–204. [Online]. Available: https://doi.org/10.1145/1375527.1375558
  195. W. Ma, Y. Hu, W. Yuan, and X. Liu, “Developing a multi-gpu-enabled preconditioned gmres with inexact triangular solves for block sparse matrices,” Mathematical Problems in Engineering, vol. 2021, pp. 1–17, 2021.
  196. S. R. K. B. Indarapu, M. K. Maramreddy, and K. Kothapalli, “Architecture- and workload- aware heterogeneous algorithms for sparse matrix vector multiplication,” in Proceedings of the 7th ACM India Computing Conference, COMPUTE 2014, Nagpur, India, October 9-11, 2014, P. Bhattacharyya, P. J. Narayanan, and S. Padmanabhuni, Eds.   ACM, 2014, pp. 3:1–3:9. [Online]. Available: https://doi.org/10.1145/2675744.2675749
  197. V. Cardellini, A. Fanfarillo, and S. Filippone, “Heterogeneous sparse matrix computations on hybrid GPU/CPU platforms,” in Parallel Computing: Accelerating Computational Science and Engineering (CSE), Proceedings of the International Conference on Parallel Computing, ParCo 2013, 10-13 September 2013, Garching (near Munich), Germany, ser. Advances in Parallel Computing, M. Bader, A. Bode, H. Bungartz, M. Gerndt, G. R. Joubert, and F. J. Peters, Eds., vol. 25.   IOS Press, 2013, pp. 203–212. [Online]. Available: https://doi.org/10.3233/978-1-61499-381-0-203
  198. W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization using partitioned spmv on gpus and multicore cpus,” IEEE Trans. Computers, vol. 64, no. 9, pp. 2623–2636, 2015. [Online]. Available: https://doi.org/10.1109/TC.2014.2366731
  199. W. Yang, K. Li, and K. Li, “A hybrid computing method of spmv on CPU-GPU heterogeneous computing systems,” J. Parallel Distributed Comput., vol. 104, pp. 49–60, 2017. [Online]. Available: https://doi.org/10.1016/j.jpdc.2016.12.023
  200. T. D. Braun, H. J. Siegel, N. Beck, L. Bölöni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys, B. Yao, D. A. Hensgen, and R. F. Freund, “A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems,” J. Parallel Distributed Comput., vol. 61, no. 6, pp. 810–837, 2001. [Online]. Available: https://doi.org/10.1006/jpdc.2000.1714
  201. W. Liu and B. Vinter, “Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors,” Parallel Comput., vol. 49, pp. 179–193, 2015. [Online]. Available: https://doi.org/10.1016/j.parco.2015.04.004
  202. G. Xiao, K. Li, Y. Chen, W. He, A. Y. Zomaya, and T. Li, “Caspmv: A customized and accelerative spmv framework for the sunway taihulight,” IEEE Trans. Parallel Distributed Syst., vol. 32, no. 1, pp. 131–146, 2021. [Online]. Available: https://doi.org/10.1109/TPDS.2019.2907537
  203. W. Li, H. Cheng, Z. Lu, Y. Lu, and W. Liu, “Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors,” in 2023 IEEE International Conference on Cluster Computing (CLUSTER).   IEEE Computer Society, 2023, pp. 209–220.
  204. T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1:1–1:25, 2011. [Online]. Available: https://doi.org/10.1145/2049662.2049663
  205. NVIDIA, “NVIDIA cuSPARSE Library,” 2023. [Online]. Available: https://docs.nvidia.com/cuda/archive/12.0.0/index.html
  206. H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra, “Optimizing Krylov Subspace Solvers on Graphics Processing Units,” in Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, IEEE.   Phoenix, AZ: IEEE, 05-2014 2014.
  207. I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, “Improving the Performance of CA-GMRES on Multicores with Multiple GPUs,” in IPDPS 2014.   Phoenix, AZ: IEEE, 05-2014 2014.
  208. D. Merrill, “Merge-based Parallel Sparse Matrix-Vector Multiplication,” https://github.com/dumerrill/merge-spmv, (Accessed on 10/5/2023).
  209. T. Muhammed, R. Mehmood, A. Albeshri, and I. Katib, “SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs,” Applied Sciences, vol. 9, no. 5, p. 947, Mar 2019. [Online]. Available: http://dx.doi.org/10.3390/app9050947
  210. J.-H. Byun, R. Lin, K. A. Yelick, and J. Demmel, “Autotuning sparse matrix-vector multiplication for multicore,” EECS, UC Berkeley, Tech. Rep, 2012.
  211. G. Tan, J. Liu, and J. Li, “Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture,” ACM Transactions on Mathematical Software, vol. 44, no. 4, pp. 46:1–46:25, 2018. [Online]. Available: https://doi.org/10.1145/3218823
  212. C. Chen, “Explicit caching hyb: a new high-performance spmv framework on gpgpu,” arXiv preprint arXiv:2204.06666, 2022.
  213. Z. Du, J. Li, Y. Wang, X. Li, G. Tan, and N. Sun, “AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022.   IEEE, 2022, pp. 1–15. [Online]. Available: https://doi.org/10.1109/SC41404.2022.00071
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jianhua Gao (8 papers)
  2. Bingjie Liu (3 papers)
  3. Weixing Ji (9 papers)
  4. Hua Huang (70 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com