Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures (2402.16731v7)

Published 26 Feb 2024 in cs.AR, cs.DC, cs.LG, and cs.PF

Abstract: Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML library that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim is publicly available at https://github.com/CMU-SAFARI/PyGim.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  2. K. Akbudak and C. Aykanat, “Exploiting locality in sparse matrix-matrix multiplication on many-core architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 8, pp. 2258–2271, 2017.
  3. V. Bharadwaj, A. Buluc, and J. Demmel, “Distributed-memory sparse kernels for machine learning,” in 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022.
  4. Å. Björck, “Numerical Methods for Least Squares Problems,” in SIAM, 1996.
  5. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
  6. W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 257–266.
  7. Ctranslate2, “Ctranslate2,” 2023. [Online]. Available: https://github.com/OpenNMT/CTranslate2
  8. L. Dagum and R. Menon, “OpenMP: An Industry-Standard API for Shared-Memory Programming,” in IEEE Comput. Sci. Eng., 1998.
  9. S. Dalton, L. Olson, and N. Bell, “Optimizing sparse matrix—matrix multiplication for the gpu,” ACM Trans. Math. Softw., 2015.
  10. P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Ganguly, “Implementation and evaluation of deep neural networks in commercially available processing in memory hardware,” in 2022 IEEE 35th International System-on-Chip Conference (SOCC), 2022, pp. 1–6.
  11. T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” in TOMS, 2011.
  12. F. Devaux, “The True Processing In Memory Accelerator,” in Hot Chips, 2019.
  13. S. Diab, A. Nassereldine, M. Alser, J. Gómez Luna, O. Mutlu, and I. El Hajj, “A framework for high-throughput sequence alignment using real processing-in-memory systems,” Bioinformatics, vol. 39, no. 5, 2023.
  14. W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin, “Graph neural networks for social recommendation,” in The world wide web conference, 2019, pp. 417–426.
  15. M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,” in ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  16. T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2020, pp. 1–14.
  17. C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, no. 1, pp. 1–49, 2022.
  18. Z. Gong, H. Ji, Y. Yao, C. W. Fletcher, C. J. Hughes, and J. Torrellas, “Graphite: Optimizing graph neural networks on cpus through cooperative software-hardware techniques,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022.
  19. J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating machine learningworkloads on memory-centric computing systems,” in 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 35–49.
  20. J. Gómez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system,” IEEE Access, vol. 10, pp. 52 565–52 608, 2022.
  21. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  22. C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019.
  23. K. Huang, J. Zhai, Z. Zheng, Y. Yi, and X. Shen, “Understanding and bridging the gaps in current gnn performance optimizations,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021.
  24. E. Im and K. A. Yelick, “Optimizing Sparse Matrix Vector Multiplication on SMP,” in PPSC, 1999.
  25. M. Item, G. F. Oliveira, J. Gómez-Luna, M. Sadrosadati, Y. Guo, and O. Mutlu, “Transpimlib: Efficient transcendental functions for processing-in-memory systems,” in 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 235–247.
  26. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 675–678.
  27. Z. Jia, S. Lin, M. Gao, M. A. Zaharia, and A. Aiken, “Improving the accuracy, scalability, and performance of graph neural networks with roc,” in Conference on Machine Learning and Systems, 2020.
  28. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  29. F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The tensor algebra compiler,” Proc. ACM Program. Lang., 2017.
  30. P. Koanantakool, A. Azad, A. Buluç, D. Morozov, S.-Y. Oh, L. Oliker, and K. Yelick, “Communication-avoiding parallel sparse-dense matrix-matrix multiplication,” in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 842–853.
  31. D. Langr and P. Tvrdík, “Evaluation Criteria for Sparse Matrix Storage Formats,” in TPDS, 2016.
  32. S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim et al., “A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65.   IEEE, 2022, pp. 1–3.
  33. S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 43–56.
  34. C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y. Kim, “Design and analysis of a processing-in-dimm join algorithm: A case study with upmem dimms,” Proc. ACM Manag. Data, 2023.
  35. W. Liu and B. Vinter, “An efficient gpu general sparse matrix-matrix multiplication for irregular data,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 370–381.
  36. D. M. Loroch, N. Wehn, F.-J. Pfreundt, and J. Keuper, “Tensorquant - a simulation toolbox for deep neural network quantization,” in arXiv, 2017.
  37. L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai, “Neugraph: Parallel deep neural network computation on large graphs,” in Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, 2019.
  38. V. Md, S. Misra, G. Ma, R. Mohanty, E. Georganas, A. Heinecke, D. Kalamkar, N. K. Ahmed, and S. Avancha, “Distgnn: Scalable distributed training for large-scale graph neural networks,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
  39. P. Mpakos, D. Galanopoulos, P. Anastasiadis, N. Papadopoulou, N. Koziris, and G. Goumas, “Feature-based spmv performance analysis on contemporary devices,” in 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023.
  40. O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing in Memory,” in Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, 2021. [Online]. Available: https://arxiv.org/pdf/2012.03112.pdf
  41. Y. Niu, Z. Lu, H. Ji, S. Song, Z. Jin, and W. Liu, “Tilespgemm: A tiled algorithm for parallel sparse general matrix-matrix multiplication on gpus,” in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022.
  42. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  43. peakperf, “peakperf,” 2021. [Online]. Available: https://github.com/Dr-Noob/peakperf.git
  44. A. Pinar and M. T. Heath, “Improving Performance of Sparse Matrix-Vector Multiplication,” in SC, 1999.
  45. U. W. Pooch and A. Nieder, “A Survey of Indexing Techniques for Sparse Matrices,” in ACM Comput. Surv., 1973.
  46. G. Qian, A. Abualshour, G. Li, A. Thabet, and B. Ghanem, “Pu-gcn: Point cloud upsampling using graph convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 683–11 692.
  47. O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluç, “Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,” in Proceedings of the ACM International Conference on Supercomputing, 2021.
  48. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan Primitives for GPU Computing,” in GH, 2007.
  49. C. Shang, J. Chen, and J. Bi, “Discrete graph structure learning for forecasting multiple time series,” in International Conference on Learning Representations, 2021.
  50. J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-Ackermann et al., “A deep learning approach to antibiotic discovery,” Cell, vol. 180, no. 4, pp. 688–702, 2020.
  51. D. Szklarczyk, A. L. Gable, D. Lyon, A. Junge, S. Wyder, J. Huerta-Cepas, M. Simonovic, N. T. Doncheva, J. H. Morris, P. Bork et al., “String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets,” Nucleic acids research, vol. 47, no. D1, pp. D607–D613, 2019.
  52. W. T. Tang, R. Zhao, M. Lu, Y. Liang, H. P. Huyng, X. Li, and R. S. M. Goh, “Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi,” in 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).   IEEE, 2015, pp. 136–145.
  53. J. Thorpe, Y. Qiao, J. Eyolfson, S. Teng, G. Hu, Z. Jia, J. Wei, K. Vora, R. Netravali, M. Kim, and G. H. Xu, “Dorylus: Affordable, scalable, and accurate GNN training with distributed CPU servers and serverless threads,” in 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), 2021.
  54. T. Tian, X. Wang, L. Zhao, W. Wu, X. Zhang, F. Lu, T. Wang, and X. Jin, “G-nmp: Accelerating graph neural networks with dimm-based near-memory processing,” Journal of Systems Architecture, vol. 129, p. 102602, 2022.
  55. A. Tripathy, K. Yelick, and A. Buluç, “Reducing communication in graph neural network training,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
  56. S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan, “Session-based recommendation with graph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 346–353.
  57. K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
  58. C. Yang, A. Buluç, and J. D. Owens, “Design principles for sparse matrix multiplication on the gpu,” in European Conference on Parallel Processing.   Springer, 2018, pp. 672–687.
  59. S. Yesil, J. E. Moreira, and J. Torrellas, “Dense dynamic blocks: optimizing spmm for processors with vector and matrix units using machine learning techniques,” in Proceedings of the 36th ACM International Conference on Supercomputing, 2022, pp. 1–14.
  60. S. Yun, H. Nam, J. Park, B. Kim, J. H. Ahn, and E. Lee, “Grande: Efficient near-data processing architecture for graph neural networks,” IEEE Transactions on Computers, 2023.
  61. H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. Prasanna, “Graphsaint: Graph sampling based inductive learning method,” arXiv preprint arXiv:1907.04931, 2019.
  62. T. Zhang, Z. Lin, G. Yang, and C. D. Sa, “Qpytorch: A low-precision arithmetic simulation framework,” in arXiv, 2019. [Online]. Available: https://github.com/Tiiiger/QPyTorch/tree/master
  63. D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X. Song, Q. Gan, Z. Zhang, and G. Karypis, “Distdgl: Distributed graph neural network training for billion-scale graphs,” in 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3), 2020.
  64. Z. Zhou, C. Li, X. Wei, X. Wang, and G. Sun, “Gnnear: Accelerating full-batch training of graph neural networks with near-memory processing,” in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2022, pp. 54–68.
Citations (3)

Summary

We haven't generated a summary for this paper yet.