Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IOPS: An Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product (2312.12766v1)

Published 20 Dec 2023 in cs.AR

Abstract: Sparse matrix multiplication (SpMM) is widely applied to numerous domains, such as graph processing, machine learning, and data analytics. However, inner product based SpMM induces redundant zero-element computing for mismatched nonzero operands, while outer product based approach lacks input reuse across Process Elements (PEs) and poor output locality for accumulating partial sum (psum) matrices. Besides, current works only focus on sparse-sparse matrix multiplication (SSMM) or sparse-dense matrix multiplication (SDMM), rarely performing efficiently for both. To address these problems, this paper proposes an unified SpMM accelerator, called IOPS, hybridizing inner with outer products. It reuses the input matrix among PEs with inner product dataflow, and removes zero-element calculations with outer product approach in each PE, which can efficiently process SSMM and SDMM. Moreover, an address mapping method is designed to accumulate the irregular sparse psum matrices, reducing the latency and DRAM access of psum accumulating. Furthermore, an adaptive partition strategy is proposed to tile the input matrices based on their sparsity ratios, effectively utilizing the storage of architecture and reducing DRAM access. Compared with the SSMM accelerator, SpArch, we achieve 1.7x~6.3x energy efficiency and 1.2x~4.4x resource efficiency, with 1.4x~2.1x DRAM access saving.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. P. D’alberto and A. Nicolau, “R-kleene: A high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks,” Algorithmica, vol. 47, no. 2, p. 203–213, feb 2007.
  2. R. Ying, R. He, and et al, “Graph convolutional neural networks for web-scale recommender systems,” in ACM KDD, 2018, p. 974–983.
  3. S. Goedecker and L. Colombo, “Tight binding molecular dynamics on parallel computers,” USA, Tech. Rep., 1994.
  4. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in arXiv, 2017. [Online]. Available: https://arxiv.org/abs/1609.02907
  5. J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
  6. N. P. Jouppi, C. Young, and et al, “In-datacenter performance analysis of a tensor processing unit,” in ISCA, 2017, pp. 1–12.
  7. G. Jeong, E. Qin, and et al, “Rasa: Efficient register-aware systolic array matrix engine for cpu,” in DAC, 2021, pp. 253–258.
  8. Y. Zhu and et al, “Exploiting parallelism with vertex-clustering in processing-in-memory-based gcn accelerators,” in DATE, 2022, pp. 652–657.
  9. ——, “Exploiting parallelism with vertex-clustering in processing-in-memory-based gcn accelerators,” in DATE, 2022, pp. 652–657.
  10. M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, “Hygcn: A gcn accelerator with hybrid architecture,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 15–29.
  11. M. Soltaniyeh and et al, “An accelerator for sparse convolutional neural networks leveraging systolic general matrix-matrix multiplication,” ACM Trans. Archit. Code Optim., vol. 19, no. 3, may 2022.
  12. S. Dalton, L. Olson, and N. Bell, “Optimizing sparse matrix—matrix multiplication for the gpu,” ACM Trans. Math. Softw., vol. 41, no. 4, oct 2015. [Online]. Available: https://doi.org/10.1145/2699470
  13. W. Liu and B. Vinter, “An efficient gpu general sparse matrix-matrix multiplication for irregular data,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 370–381.
  14. J. Lee, Kang, and et al, “Optimization of gpu-based sparse matrix multiplication for large sparse networks,” in ICDE, 2020, pp. 925–936.
  15. N. G, N. S, and K. S, “Moscon: Modified outer product based sparse matrix-matrix multiplication accelerator with configurable tiles,” in VLSID, 2023, pp. 264–269.
  16. S. Pal, J. Beaumont, and et al, “Outerspace: An outer product based sparse matrix multiplication accelerator,” in HPCA, 2018, pp. 724–736.
  17. Z. Zhang, H. Wang, and et al, “Sparch: Efficient architecture for sparse matrix multiplication,” in HPCA, 2020, pp. 261–274.
  18. T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y. Li, P. Haghi, A. Tumeo, S. Che, S. Reinhardt, and M. C. Herbordt, “Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 922–936.
  19. J. Li, A. Louri, A. Karanth, and R. Bunescu, “Gcnax: A flexible and energy-efficient accelerator for graph convolutional neural networks,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 775–788.
  20. N. Srivastava, H. Jin, and et al, “Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in MICRO, 2020, pp. 766–780.
  21. H. Kaplan, M. Sharir, and E. Verbin, “Colored intersection searching via sparse rectangular matrix multiplication,” in Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, ser. SCG ’06.   New York, NY, USA: Association for Computing Machinery, 2006, p. 52–60. [Online]. Available: https://doi.org/10.1145/1137856.1137866
  22. J. R. Gilbert, S. Reinhardt, and V. B. Shah, “A unified framework for numerical and combinatorial computing,” Computing in Science and Engg., vol. 10, no. 2, p. 20–25, mar 2008. [Online]. Available: https://doi.org/10.1109/MCSE.2008.45
  23. M. O. Rabin and V. V. Vazirani, “Maximum matchings in general graphs through randomization,” Journal of Algorithms, vol. 10, no. 4, pp. 557–567, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0196677489900059
  24. W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 1025–1035.
  25. K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” ArXiv, vol. abs/1810.00826, 2018. [Online]. Available: https://arxiv.org/abs/1810.00826
  26. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” 2018. [Online]. Available: https://arxiv.org/abs/1710.10903
  27. A. McCallum, “Cora,” http://www.cs.umd.edu/~sen/lbc-proj/LBC.html.
  28. C. L. Giles, K. D. Bollacker, and S. Lawrence, “Citeseer: An automatic citation indexing system,” in Proceedings of the Third ACM Conference on Digital Libraries, ser. DL ’98, 1998, p. 89–98.
  29. P. Sen, G. Namata, and et al, “Collective classification in network data,” in The AI Magazine, 2008.
  30. A. Carlson, J. Betteridge, and et al, “Toward an architecture for never-ending language learning,” in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence.   AAAI Press, 2010, p. 1306–1313.
  31. P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’15.   New York, NY, USA: Association for Computing Machinery, 2015, p. 1365–1374. [Online]. Available: https://doi.org/10.1145/2783258.2783417
  32. O. Villa, D. R. Johnson, M. Oconnor, E. Bolotin, D. Nellans, J. Luitjens, N. Sakharnykh, P. Wang, P. Micikevicius, A. Scudiero, S. W. Keckler, and W. J. Dally, “Scaling the power wall: A path to exascale,” in SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014, pp. 830–841.

Summary

We haven't generated a summary for this paper yet.