Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures (2309.07984v3)
Abstract: Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) ML primitives. While such proposals are reasonable given the the growing importance of ML, as memory is a pervasive component, %in this work, we make there is a case for a more inclusive PIM design that can accelerate primitives across domains. In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly.
- “Jedec high bandwidth memory (hbm) dram,” https://www.jedec.org/standards-documents/docs/jesd235a, 2013.
- “Jedec publishes hbm3 update to high bandwidth memory (hbm) standard,” https://www.jedec.org/news/pressreleases/jedec-publishes-hbm3-update-high-bandwidth-memory-hbm-standard, 2022.
- “rocprofiler developer tool,” https://github.com/ROCm-Developer-Tools/rocprofiler, 2022.
- A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, M. Gates, T. Grützmacher, N. J. Higham, S. Li et al., “A survey of numerical methods utilizing mixed precision arithmetic,” arXiv preprint arXiv:2007.06674, 2020.
- A. Addisie, H. Kassa, O. Matthews, and V. Bertacco, “Heterogeneous memory subsystem for natural graph analytics,” in 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2018, pp. 134–145.
- S. Aga, N. Jayasena, and M. Ignatowski, “Co-ml: a case for collaborative ml acceleration using near-data processing,” in Proceedings of the International Symposium on Memory Systems, 2019, pp. 506–517.
- S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 481–492.
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015, pp. 105–117.
- J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015, pp. 336–348.
- J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 336–348, 2015.
- H. Anzt, G. Flegar, T. Grützmacher, and E. S. Quintana-Ortí, “Toward a modular precision ecosystem for high-performance computing,” The International Journal of High Performance Computing Applications, vol. 33, no. 6, pp. 1069–1078, 2019.
- M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler, “To push or to pull: On reducing communication and synchronization in graph computations,” in Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017, pp. 93–104.
- A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan et al., “Google workloads for consumer devices: Mitigating data movement bottlenecks,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018, pp. 316–331.
- A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarungnirun, K. Hsieh, N. Hajinazar, K. T. Malladi, H. Zheng et al., “Conda: Efficient cache coherence support for near-data accelerators,” in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 629–642.
- A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K. T. Malladi, H. Zheng, and O. Mutlu, “Lazypim: An efficient cache coherence mechanism for processing-in-memory,” IEEE Computer Architecture Letters, vol. 16, no. 1, pp. 46–50, 2016.
- P. Briggs, K. D. Cooper, and L. Torczon, “Improvements to graph coloring register allocation,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 16, no. 3, pp. 428–455, 1994.
- J. Choe, A. Huang, T. Moreshet, M. Herlihy, and R. I. Bahar, “Concurrent data structures with near-data-processing: An architecture-aware implementation,” in The 31st ACM Symposium on Parallelism in Algorithms and Architectures, 2019, pp. 297–308.
- F. C. Chow and J. L. Hennessy, “The priority-based coloring approach to register allocation,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 12, no. 4, pp. 501–536, 1990.
- Criteo, “Criteo terabyte click logs dataset,” https://ailab.criteo.com/criteo-1tb-click-logs-dataset/.
- T. A. Davis et al., “Suitesparse: A suite of sparse matrix software,” 2015. [Online]. Available: https://people.engr.tamu.edu/davis/suitesparse.html
- F. Devaux, “The true processing in memory accelerator,” in 2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 2019, pp. 1–24.
- J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, D. Mukunoki, A. Podobas, M. WahibT et al., “Matrix engines for high performance computing: A paragon of performance or grasping at straws?” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 1056–1065.
- Z. Fu, M. Personick, and B. Thompson, “Mapgraph: A high level api for fast development of high performance graph analytics on gpus,” in Proceedings of workshop on GRAph data management experiences and systems, 2014, pp. 1–6.
- C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, no. 1, pp. 1–49, 2022.
- J. Gómez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware,” in 2021 12th International Green and Sustainable Computing Conference (IGSC). IEEE, 2021, pp. 1–7.
- A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers,” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 603–613.
- B. Hanindhito, R. Li, D. Gourounas, A. Fathi, K. Govil, D. Trenev, A. Gerstlauer, and L. John, “Wave-pim: Accelerating wave simulation using processing-in-memory,” in 50th International Conference on Parallel Processing, 2021, pp. 1–11.
- M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. Vijaykumar, “Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 372–385.
- L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee et al., “Recnmp: Accelerating personalized recommendation with near-memory processing,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 790–803.
- J. H. Kim, S.-H. Kang, S. Lee, H. Kim, Y. Ro, S. Lee, D. Wang, J. Choi, J. So, Y. Cho et al., “Aquabolt-xl hbm2-pim, lpddr5-pim with in-memory processing, and axdimm with acceleration buffer,” IEEE Micro, vol. 42, no. 3, pp. 20–30, 2022.
- Y. Kwon, Y. Lee, and M. Rhu, “Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 740–753.
- C.-C. Lee, C. Hung, C. Cheung, P.-F. Yang, C.-L. Kao, D.-L. Chen, M.-K. Shih, C.-L. C. Chien, Y.-H. Hsiao, L.-C. Chen et al., “An overview of the development of a gpu with integrated hbm on silicon interposer,” in 2016 IEEE 66th Electronic Components and Technology Conference (ECTC). IEEE, 2016, pp. 1439–1444.
- S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.
- S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware architecture and software stack for pim based on commercial dram technology : Industrial product,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 43–56.
- W. J. Lee, C. H. Kim, Y. Paik, J. Park, I. Park, and S. W. Kim, “Design of processing-“inside”-memory optimized for dram behaviors,” IEEE Access, vol. 7, pp. 82 633–82 648, 2019.
- S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A dram-based reconfigurable in-situ accelerator,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 288–301.
- Y. Li, T. Bai, X. Xu, Y. Zhang, B. Wu, H. Cai, B. Pan, and W. Zhao, “A survey of mram-centric computing: From near memory to in memory,” IEEE Transactions on Emerging Topics in Computing, 2022.
- S. Mittal, “A survey of reram-based architectures for processing-in-memory and neural networks,” Machine learning and knowledge extraction, vol. 1, no. 1, pp. 75–114, 2018.
- D. G. Murray, J. Šimša, A. Klimovic, and I. Indyk, “Tf.data: A machine learning data processing framework,” Proc. VLDB Endow., vol. 14, no. 12, p. 2945–2958, jul 2021. [Online]. Available: https://doi.org/10.14778/3476311.3476374
- L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, “Graphpim: Enabling instruction-level pim offloading in graph computing frameworks,” in 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, 2017, pp. 457–468.
- L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin, “Graphbig: understanding graph computing in the context of industrial solutions,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015, pp. 1–12.
- R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. Costa, J. Doi, C. Evangelinos et al., “Active memory cube: A processing-in-memory architecture for exascale systems,” IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17–1, 2015.
- M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., “Deep learning recommendation model for personalization and recommendation systems,” arXiv preprint arXiv:1906.00091, 2019.
- S. Pati, S. Aga, N. Jayasena, and M. D. Sinclair, “Demystifying bert: System design implications,” in 2022 IEEE International Symposium on Workload Characterization (IISWC), 2022, pp. 296–309.
- G. Salvador, W. H. Darvin, M. Huzaifa, J. Alsop, M. D. Sinclair, and S. V. Adve, “Specializing coherence, consistency, and push/pull for gpu graph analytics,” in 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2020, pp. 123–125.
- V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory accelerator for bulk bitwise operations using commodity dram technology,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 273–287.
- L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Graphr: Accelerating graph processing using reram,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 531–543.
- Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel et al., “Gunrock: Gpu graph analytics,” ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 1, pp. 1–49, 2017.
- L. C. Wilcox, G. Stadler, C. Burstedde, and O. Ghattas, “A high-order discontinuous galerkin method for wave propagation through coupled elastic–acoustic media,” Journal of Computational Physics, vol. 229, no. 24, pp. 9373–9396, 2010.
- Y. Xi, B. Gao, J. Tang, A. Chen, M.-F. Chang, X. S. Hu, J. Van Der Spiegel, H. Qian, and H. Wu, “In-memory learning with analog resistive switching memory: A review and perspective,” Proceedings of the IEEE, vol. 109, no. 1, pp. 14–42, 2020.
- D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Top-pim: Throughput-oriented programmable processing in memory,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, 2014, pp. 85–98.
- M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen, C. Kozyrakis, and X. Qian, “Graphp: Reducing communication for pim-based graph processing with efficient data partition,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 544–557.
- M. Zhao, N. Agarwal, A. Basant, B. Gedik, S. Pan, M. Ozdal, R. Komuravelli, J. Pan, T. Bao, H. Lu, S. Narayanan, J. Langman, K. Wilfong, H. Rastogi, C.-J. Wu, C. Kozyrakis, and P. Pol, “Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, ser. ISCA ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1042–1057. [Online]. Available: https://doi.org/10.1145/3470496.3533044
- M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 359–371.
- Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian, “Graphq: Scalable pim-based graph processing,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 712–725.
- Johnathan Alsop (4 papers)
- Shaizeen Aga (12 papers)
- Mohamed Ibrahim (14 papers)
- Mahzabeen Islam (6 papers)
- Nuwan Jayasena (7 papers)
- Andrew McCrabb (2 papers)