WWW: What, When, Where to Compute-in-Memory (2312.15896v2)
Abstract: Compute-in-memory (CiM) has emerged as a highly energy efficient solution for performing matrix multiplication during Machine Learning (ML) inference. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture evaluation methodology where we tailor the dataflow mapping. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our experiments show that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to tensor-core-like baseline architecture, with INT-8 precision under iso-area constraints. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.
- Timeloop: A systematic approach to dnn accelerator evaluation. In 2019 IEEE international symposium on performance analysis of systems and software (ISPASS), pages 304–315. IEEE, 2019.
- Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2019.
- Future computing hardware for ai. In 2018 IEEE International Electron Devices Meeting (IEDM), pages 1–3. IEEE, 2018.
- Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
- Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.
- In-memory computing: Advances and prospects. IEEE Solid-State Circuits Magazine, 11(3):43–55, 2019.
- Resistive crossbars as approximate hardware building blocks for machine learning: Opportunities and challenges. Proceedings of the IEEE, 108(12):2276–2310, 2020.
- Jae-Sun Seo. Advances and trends on on-chip compute-in-memory macros and accelerators. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2023.
- A modern primer on processing in memory. In Emerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann, pages 171–243. Springer, 2022.
- S-flash: A nand flash-based deep neural network accelerator exploiting bit-level sparsity. IEEE Transactions on Computers, 71(6):1291–1304, 2022.
- Duality cache for data parallel acceleration. In Proceedings of the 46th International Symposium on Computer Architecture, pages 397–410, 2019.
- Multi-layer in-memory processing. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 920–936. IEEE, 2022.
- Livia: Data-centric computing throughout the memory hierarchy. In ASPLOS, pages 417–433, 2020.
- NVIDIA. Ampere Architecture. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth.
- A 28-nm compute sram with bit-serial logic/arithmetic operations for programmable in-memory vector computing. IEEE Journal of Solid-State Circuits, 55(1):76–86, 2020.
- Xin Si and et al. A local computing cell and 6t sram-based computing-in-memory macro with 8-b mac operation for edge ai chips. IEEE Journal of Solid-State Circuits, 56(9):2817–2831, 2021.
- A 65 nm 1.4-6.7 tops/w adaptive-snr sparsity-aware cim core with load balancing support for dl workloads. In 2023 IEEE Custom Integrated Circuits Conference (CICC), pages 1–2, 2023.
- Yu-Der Chih et al. 16.4 an 89tops/w and 16.3tops/mm2 all-digital sram-based full-precision compute-in memory macro in 22nm for machine-learning edge applications. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), volume 64, pages 252–254, 2021.
- Haruki Mori et al. A 4nm 6163-tops/w/b 𝟒𝟕𝟗𝟎−𝐓𝐎𝐏𝐒/𝐦𝐦𝟐/𝐛4790𝐓𝐎𝐏𝐒superscript𝐦𝐦2𝐛\mathbf{4790-TOPS/mm^{2}/b}bold_4790 - bold_TOPS / bold_mm start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT / bold_b sram based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous mac and weight update. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 132–134, 2023.
- Ping-Chun Wu et al. A 28nm 1mb time-domain computing-in-memory 6t-sram macro with a 6.6ns latency, 1241gops and 37.01tops/w for 8b-mac operations for edge-ai devices. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 1–3, 2022.
- Hidehiro Fujiwara et al. A 5-nm 254-tops/w 221-tops/mm2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage-frequency scaling and simultaneous mac and write operations. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 1–3, 2022.
- cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
- Nvidia Docs. Matrix Multiplication Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, 2020-23.
- Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
- To pim or not for emerging general purpose processing in ddr memory systems. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 231–244, 2022.
- Benchmarking and modeling of analog and digital sram in-memory computing architectures. arXiv preprint arXiv:2305.18335, 2023.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), pages 10–14. IEEE, 2014.
- Benchmarking in-memory computing architectures. IEEE Open Journal of the Solid-State Circuits Society, 2:288–300, 2022.
- Impulse: A 65-nm digital compute-in-memory macro with fused weights and membrane potential for spike-based sequential learning tasks. IEEE Solid-State Circuits Letters, 4:137–140, 2021.
- Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In ASPLOS, 2019.
- 15.3 a 351tops/w and 372.4gops compute-in-memory sram macro in 7nm finfet cmos for machine-learning applications. In 2020 IEEE International Solid-State Circuits Conference - (ISSCC), pages 242–244, 2020.
- Scnn: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH computer architecture news, 45(2):27–40, 2017.
- Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings. IEEE micro, 40(3):20–29, 2020.
- Zigzag: A memory-centric rapid dnn accelerator design space exploration framework. arXiv preprint arXiv:2007.11360, 2020.
- Accelergy-Timeloop Processing-in-memory (PIM) Example. https://github.com/Accelergy-Project/processing-in-memory-design, 2020.
- Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm. Integration, 58:74–81, 2017.
- Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Machine Learning Accelerator War. https://www.sigarch.org/an-academics-attempt-to-clear-the-fog-of-the-machine-learning-accelerator-war/.
- A 22nm 832kb hybrid-domain floating-point sram in-memory-compute macro with 16.2-70.2 tflops/w for high-accuracy ai-edge devices. In 2023 IEEE International Solid-State Circuits Conference (ISSCC), pages 126–128. IEEE, 2023.
- A 28nm 16.9-300tops/w computing-in-memory processor supporting floating-point nn inference/training with intensive-cim sparse-digital architecture. In 2023 IEEE International Solid-State Circuits Conference (ISSCC), pages 1–3. IEEE, 2023.
- Bonan Yan et al. A 1.041-mb/mm2 27.38-tops/w signed-int8 dynamic-logic-based adc-less sram compute-in-memory macro in 28nm with reconfigurable bitwise operation for ai and embedded applications. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 188–190, 2022.
- Towards adc-less compute-in-memory accelerators for energy efficient deep learning. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 624–627. IEEE, 2022.
- CUTLASS. https://github.com/NVIDIA/cutlass, 2023.
- Accelwattch: A power modeling framework for modern gpus. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 738–753, 2021.
- Tanvi Sharma (8 papers)
- Mustafa Ali (6 papers)
- Indranil Chakraborty (32 papers)
- Kaushik Roy (265 papers)