Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference (2404.04708v1)
Abstract: Emerging ML models (e.g., transformers) involve memory pin bandwidth-bound matrix-vector (MV) computation in inference. By avoiding pin crossings, processing in memory (PIM) can improve performance and energy for pin-bound workloads, as evidenced by recent commercial efforts in (digital) PIM. Sparse models can improve performance and energy of inference without losing much accuracy. However, unstructured sparse inference injects the key challenges of uncertainty, irregularity, and load imbalance into a dense PIM's operation across all the banks. The dense PIM reads the matrix cells from each bank and broadcasts the vector elements to all the banks exploiting DRAM organization. To address these challenges efficiently, we propose ESPIM which makes four contributions: (1) Because matrix sparsity increases the vector broadcast bandwidth demand per matrix column-read, ESPIM employs a fine-grained interleaving of the matrix cells so that each vector broadcast is shared among multiple rows in each bank, cutting the bandwidth demand. (2) ESPIM mostly avoids on-chip control's area and energy despite sparsity's uncertainties by exploiting the observation that the sparsity is data-dependent but static and known before inference. Accordingly, ESPIM employs static data-dependent scheduling (SDDS) (3) ESPIM decouples the matrix cell values and their indices, placing the indices ahead of the values to enable prefetching of the vector elements. We extend SDDS for performance and correctness with the decoupled prefetching. (4) Finally, we simplify the switch required to select the vector elements that match the matrix cells. We extend SDDS to improve performance by reducing conflicts in the simplified switch. In our simulations, ESPIM achieves 2x average (up to 4.2x) speedup over and 34% average (up to 63%) lower energy than Newton while incurring under 5% area.
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 105–117. [Online]. Available: http://doi.acm.org/10.1145/2749469.2750386
- B. Akin, F. Franchetti, and J. C. Hoe, “Data reorganization in memory using 3d-stacked dram,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 131–143. [Online]. Available: http://doi.acm.org/10.1145/2749469.2750397
- J. Albericio, P. Judd, T. H. Hetherington, T. M. Aamodt, N. D. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network computing,” in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, 2016, pp. 1–13. [Online]. Available: http://dx.doi.org/10.1109/ISCA.2016.11
- A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic, “Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’19. New York, NY, USA: ACM, 2019, pp. 715–731. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304049
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, April 2009, pp. 163–174.
- R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,” ACM Trans. Archit. Code Optim., vol. 14, no. 2, jun 2017. [Online]. Available: https://doi.org/10.1145/3085572
- J. B. Brockman, S. Thoziyoor, S. K. Kuntz, and P. M. Kogge, “A low cost, multithreaded processing-in-memory system,” in Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture, ser. WMPI ’04. New York, NY, USA: ACM, 2004, pp. 16–22. [Online]. Available: http://doi.acm.org/10.1145/1054943.1054946
- “Creating sparse GPT-3 models with iterative pruning,” https://www.cerebras.net/blog/creating-sparse-gpt-3-models-with-iterative-pruning, Cerebras, accessed: 2023-11-14.
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 27–39. [Online]. Available: https://doi.org/10.1109/ISCA.2016.13
- B. Y. Cho, W. S. Jeong, D. Oh, and W. W. Ro, “XSD: Accelerating MapReduce by Harnessing GPU inside SSD,” in 1st Workshop on Near Data Processing (WoNDP 2013) In Conjunction with the 46th International Symposium on Microarchitecture, 2013.
- R. Cong, W. He, M. Li, B. Luo, Z. Yang, Y. Yang, R. Huang, and B. Yan, “Attentionlego: An open-source building block for spatially-scalable large language model accelerator with processing-in-memory technology,” CoRR, vol. abs/2401.11459, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.11459
- A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, Feb 2015, pp. 283–295.
- E. Frantar and D. Alistarh, “SparseGPT: Massive language models can be accurately pruned in one-shot,” 2023.
- M. Gao, G. Ayers, and C. Kozyrakis, “Practical near-data processing for in-memory analytics frameworks,” in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015, pp. 113–124.
- M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and efficient neural network acceleration with 3d memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 751–764. [Online]. Available: https://doi.org/10.1145/3037697.3037702
- A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, “Sparten: A sparse tensor accelerator for convolutional neural networks,” in Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: ACM, 2019, pp. 151–165. [Online]. Available: http://doi.acm.org/10.1145/3352460.3358291
- Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, “Ac-dimm: Associative computing with stt-mram,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM, 2013, pp. 189–200. [Online]. Available: http://doi.acm.org/10.1145/2485922.2485939
- M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki, J. Brockman, A. Srivastava, W. Athas, V. Freeh, J. Shin, and J. Park, “Mapping irregular applications to diva, a pim-based data-intensive architecture,” in Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, ser. SC ’99. New York, NY, USA: ACM, 1999. [Online]. Available: http://doi.acm.org/10.1145/331532.331589
- S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” 2016.
- S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 1135–1143.
- M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar, “Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 372–385.
- R. Hojabr, A. Sedaghati, A. Sharifian, A. Khonsari, and A. Shriraman, “SPAGHETTI: Streaming accelerators for highly sparse GEMM on FPGAs,” in 2021 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2021, pp. 84–96.
- JEDEC Standard, “High bandwidth memory (HBM) DRAM, JESD235D,” 2015. [Online]. Available: https://doi.org/10.1145/3352460.3358291
- Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, “Flexram: toward an advanced intelligent memory system,” in Computer Design, 1999. (ICCD ’99) International Conference on, 1999, pp. 192–201.
- A. Kerr, D. Merrill, J. Demouth, and J. Tran, “Cutlass: Fast linear algebra in cuda c++,” 2015. [Online]. Available: https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
- D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. IEEE Press, 2016, p. 380–392. [Online]. Available: https://doi.org/10.1109/ISCA.2016.41
- J. H. Kim, S.-h. Kang, S. Lee, H. Kim, W. Song, Y. Ro, S. Lee, D. Wang, H. Shin, B. Phuah, J. Choi, J. So, Y. Cho, J. Song, J. Choi, J. Cho, K. Sohn, Y. Sohn, K. Park, and N. S. Kim, “Aquabolt-XL: Samsung hbm2-pim with in-memory processing for ml accelerators and beyond,” in 2021 IEEE Hot Chips 33 Symposium (HCS), 2021, pp. 1–26.
- Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, Y. Cho, J. G. Kim, J. Choi, H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, E.-B. Kim, D. Wang, S. Kang, Y. Ro, S. Seo, J. Song, J. Youn, K. Sohn, and N. S. Kim, “25.4 a 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2 TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, 2021, pp. 350–352.
- Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, Y. Cho, J. G. Kim, J. Choi, H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, E.-B. Kim, D. Wang, S. Kang, Y. Ro, S. Seo, J. Song, J. Youn, K. Sohn, and N. S. Kim, “25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops programmable computing unit using bank-level parallelism, for machine learning applications,” in 2021 IEEE International Solid- State Circuits Conference (ISSCC), vol. 64, 2021, pp. 350–352.
- D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong, “25.2 a 1.2v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv,” in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 432–433.
- S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based accelerator-in-memory supporting 1 TFLOPS MAC operation and various activation functions for deep-learning applications,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.
- S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), June 2016, pp. 393–405.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- Z.-G. Liu, P. N. Whatmough, Y. Zhu, and M. Mattina, “S2ta: Exploiting structured sparsity for energy-efficient mobile cnn acceleration,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 573–586.
- R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C. Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura, “Active memory cube: A processing-in-memory architecture for exascale systems,” IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17:1–17:14, March 2015.
- NCSU, “Freepdk45.” [Online]. Available: https://www.eda.ncsu.edu/wiki/FreePDK45
- Neural Magic, “Sparse zoo,” 2021. [Online]. Available: https://docs.neuralmagic.com/sparsezoo/
- Nitin, M. Thottethodi, and T. N. Vijaykumar, “Millipede: Die-stacked memory optimizations for big data machine learning analytics,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018, pp. 160–171.
- S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “OuterSPACE: An outer product based sparse matrix multiplication accelerator,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018, pp. 724–736.
- A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 27–40. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080254
- D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent ram,” IEEE Micro, vol. 17, no. 2, pp. 34–44, Mar. 1997. [Online]. Available: http://dx.doi.org/10.1109/40.592312
- S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li, “NDC: analyzing the impact of 3d-stacked memory+logic devices on mapreduce workloads,” in 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23-25, 2014, 2014, pp. 190–200. [Online]. Available: http://dx.doi.org/10.1109/ISPASS.2014.6844483
- M. A. Raihan, N. Goli, and T. Aamodt, “Modeling deep learning accelerator enabled gpus,” 2019.
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory system simulator,” IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, 2011.
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 14–26. [Online]. Available: https://doi.org/10.1109/ISCA.2016.12
- H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, “Mcdram: Low latency and energy-efficient matrix computations in dram,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2613–2622, 2018.
- K. Sohn, W.-J. Yun, R. Oh, C.-S. Oh, S.-Y. Seo, M.-S. Park, D.-H. Shin, W.-C. Jung, S.-H. Shin, J.-M. Ryu, H.-S. Yu, J.-H. Jung, H. Lee, S.-Y. Kang, Y.-S. Sohn, J.-H. Choi, Y.-C. Bae, S.-J. Jang, and G. Jin, “A 1.2 V 20 nm 307 GB/s HBM DRAM with at-speed wafer-level IO test scheme and adaptive refresh considering temperature distribution,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 250–260, 2017.
- L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 541–552.
- N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, “Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 766–780.
- H. S. Stone, “A logic-in-memory computer,” Computers, IEEE Transactions on, vol. C-19, no. 1, pp. 73–78, Jan 1970.
- M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=PxoFut3dWW
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,” 2023.
- Y. Wang, C. Zhang, Z. Xie, C. Guo, Y. Liu, and J. Leng, “Dual-side sparse tensor core,” in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA ’21. IEEE Press, 2021, p. 1083–1095. [Online]. Available: https://doi.org/10.1109/ISCA52012.2021.00088
- W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc., 2016. [Online]. Available: https://proceedings.neurips.cc/paper/2016/file/41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf
- M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared llama: Accelerating language model pre-training via structured pruning,” 2023.
- X. Xie, Z. Liang, P. Gu, A. Basak, L. Deng, L. Liang, X. Hu, and Y. Xie, “SpaceA: Sparse matrix vector multiplication on processing-in-memory accelerator,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 570–583.
- G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, “Gamma: Leveraging gustavson’s algorithm to accelerate sparse matrix multiplication,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 687–701. [Online]. Available: https://doi.org/10.1145/3445814.3446702
- S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-x: An accelerator for sparse neural networks,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2016, pp. 1–12.
- Z. Zhang, H. Wang, S. Han, and W. J. Dally, “SpArch: Efficient architecture for sparse matrix multiplication,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 261–274.
- M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: Association for Computing Machinery, 2019, p. 359–371. [Online]. Available: https://doi.org/10.1145/3352460.3358269