Balanced Data Placement for GEMV Acceleration with Processing-In-Memory (2403.20297v2)
Abstract: With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe in this work that a key impediment to truly harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out several factors that impact data-placement and propose PIMnast methodology which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, our proposed PIMnast methodology along with additional orchestration knobs we identify delivers up to 6.86$\times$ speedup for GEMVs (of the available 7$\times$ roofline speedup) leading to up to 5$\times$ speedup for per-token latencies.
- S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,” in Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA), 2021.
- Samsung, “PIM — Technology — Samsung Semiconductor USA,” https://semiconductor.samsung.com/us/solutions/technology/pim/, 2024.
- S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory Supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications,” in Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), 2022.
- Amir Gholami, “AI and Memory Wall,” https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8, 2021.
- B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, S. Dusan, V. Elango, M. Golub, A. Heinecke, P. James-Roxby, D. Jani, G. Kolhe, M. Langhammer, A. Li, L. Melnick, M. Mesmakhosroshahi, A. Rodriguez, M. Schulte, R. Shafipour, L. Shao, M. Siu, P. Dubey, P. Micikevicius, M. Naumov, C. Verilli, R. Wittig, and E. Chung, “Microscaling Data Formats for Deep Learning,” arXiv, 2023.
- AMD, “AMD Ryzen™ 7 7840HS — AMD,” https://www.amd.com/en/product/13041, 2024.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022.
- Open Compute Project, “OCP MicroXcaling (MX) Specification,” https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf, 2024.
- Databricks, “LLM Inference Performance Engineering: Best Practices,” https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices, 2023.
- M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. Vijaykumar, “Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning,” in Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020.
- AMD, “AMD Optimizing CPU Libraries (AOCL),” https://www.amd.com/en/developer/aocl.html, 2024.
- Intel, “Intel oneAPI Math Kernel Library (oneMKL),” https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemk l.html, 2024.
- AMD, “rocBLAS Documentation,” https://rocm.docs.amd.com/ projects/rocBLAS/en/latest/, 2024.
- NVIDIA, “Basic Linear Algebra on NVIDIA GPUs,” https://developer.nvidia.com/cublas, 2024.
- AMD, “Composable Kernel,” , 2024.
- NVIDIA, “CUTLASS 3.5,” https://nvidia.github.io/cutlass/, 2024.
- AMD, “AMD matrix cores,” https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/, 2024.
- NVIDIA, “NVIDIA Tensor Cores,” https://www.nvidia.com/en-us/data-center/tensor-cores/, 2024.
- Intel, “Intel Gaudi,” https://habana.ai/, 2024.
- I. Ahmed, S. Parmar, M. Boyd, M. Beidler, K. Kang, B. Liu, K. Roach, J. Kim, and D. Abts, “Answer Fast: Accelerating BERT on the Tensor Streaming Processor,” in Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2022.
- G. F. Oliveira, J. Gómez-Luna, S. Ghose, A. Boroumand, and O. Mutlu, “Accelerating Neural Network Inference With Processing-in-DRAM: From the Edge to the Cloud,” IEEE Micro, 2022.
- “Data access optimization in a processing-in-memory system, author=Sura, Zehra and Jacob, Arpith and Chen, Tong and Rosenburg, Bryan and Sallenave, Olivier and Bertolli, Carlo and Antao, Samuel and Brunheroto, Jose and Park, Yoonho and O’Brien, Kevin and others,” in Proceedings of the ACM International Conference on Computing Frontiers, 2015.
- B. Y. Cho, J. Jung, and M. Erez, “Accelerating bandwidth-bound deep learning inference with main-memory accelerators,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
- J. Gómez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,” IEEE Access, 2022.
- M. S. Q. Truong, E. Chen, D. Su, L. Shen, A. Glass, L. R. Carley, J. A. Bain, and S. Ghose, “RACER: Bit-Pipelined Processing Using Resistive Memory,” in Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021.
- S. Aga, N. Jayasena, and M. Ignatowski, “Co-ML: A Case for Collaborative ML Acceleration Using near-Data Processing,” in Proceedings of the International Symposium on Memory Systems (MEMSYS), 2019.
- S. Pati, S. Aga, N. Jayasena, and M. D. Sinclair, “Demystifying BERT: System Design Implications,” in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2022.
- L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y. Cho, J. H. Kim, Y. Kwon, K. Kim, J. Jung, I. Yun, S. J. Park, H. Park, J. Song, J. Cho, K. Sohn, N. S. Kim, and H.-H. S. Lee, “Near-memory processing in action: Accelerating personalized recommendation with axdimm,” IEEE Micro, 2022.
- J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems,” in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023.
- M. A. Ibrahim, S. Aga, A. Li, S. Pati, and M. Islam, “Just-in-time Quantization with Processing-In-Memory for Efficient ML Training,” 2023.
- O. Leitersdorf, Y. Boneh, G. Gazit, R. Ronen, and S. Kvatinsky, “FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication,” Memories - Materials, Devices, Circuits and Systems, 2023.
- M. A. Ibrahim and S. Aga, “Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures,” arXiv, 2023.
- Mohamed Assem Ibrahim (4 papers)
- Mahzabeen Islam (6 papers)
- Shaizeen Aga (12 papers)