Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity (2404.09497v1)

Published 15 Apr 2024 in cs.AR

Abstract: Bit-level sparsity in neural network models harbors immense untapped potential. Eliminating redundant calculations of randomly distributed zero-bits significantly boosts computational efficiency. Yet, traditional digital SRAM-PIM architecture, limited by rigid crossbar architecture, struggles to effectively exploit this unstructured sparsity. To address this challenge, we propose Dyadic Block PIM (DB-PIM), a groundbreaking algorithm-architecture co-design framework. First, we propose an algorithm coupled with a distinctive sparsity pattern, termed a dyadic block (DB), that preserves the random distribution of non-zero bits to maintain accuracy while restricting the number of these bits in each weight to improve regularity. Architecturally, we develop a custom PIM macro that includes dyadic block multiplication units (DBMUs) and Canonical Signed Digit (CSD)-based adder trees, specifically tailored for Multiply-Accumulate (MAC) operations. An input pre-processing unit (IPU) further refines performance and efficiency by capitalizing on block-wise input sparsity. Results show that our proposed co-design framework achieves a remarkable speedup of up to 7.69x and energy savings of 83.43%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Alex Krizhevsky et al. Imagenet Classification with Deep Convolutional Neural Networks. In Proceedings of the NIPS, 2012.
  2. Very Deep Convolutional Networks for End-to-End Speech Recognition. In Proceedings of the ICASSP, 2017.
  3. A Survey of Autonomous Driving: Common Practices and Emerging Technologies. IEEE ACCESS, 2020.
  4. An 89TOPS/W and 16.3 TOPS/mm22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT All-Digital SRAM-based Full-Precision Compute-in Memory Macro in 22nm for Machine-Learning Edge Applications. In Proceedings of the ISSCC, 2021.
  5. DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-based Processing-in-Memory. IEEE TCAD, 2024.
  6. Tzu-Hsien Yang et al. Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks. In Proceedings of the ISCA, 2019.
  7. Bit-Transformer: Transforming Bit-Level Sparsity into Higher Preformance in ReRAM-based Accelerator. In Proceedings of the ICCAD, 2021.
  8. TCIM: Triangle Counting Acceleration with Processing-in-MRAM Architecture. In Proceedings of the DAC, 2020.
  9. Xuhang Chen et al. Accelerating Graph-Connected Component Computation with Emerging Processing-in-Memory Architecture. IEEE TCAD, 2022.
  10. NAND-SPIN-Based Processing-in-MRAM Architecture for Convolutional Neural Network Acceleration. SCIS, 2023.
  11. SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration. IEEE TCAD, 2022.
  12. A 2.75-to-75.9 TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating. In Proceedings of the ISSCC, 2021.
  13. A 28nm 53.8 TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CIM-based Local-Attention-Reusable Engine. In Proceedings of the ISSCC, 2023.
  14. MuITCIM: A 28nm 2.24 uJ/token Attention-Token-Bit Hybrid Sparse Digital CIM-based Accelerator for Multimodal Transformers. In Proceedings of the ISSCC, 2023.
  15. TT@ CIM: A Tensor-Train in-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and Variable Precision Quantization. IEEE JSSC, 2022.
  16. Sampatrao L Pinjare et al. Implementation of Artificial Neural Network Architecture for Image Compression Using CSD Multiplier. In Proceedings of the ERCICA, 2013.
  17. A 1.041-Mb/mm22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 27.38-TOPS/W Signed-INT8 Dynamic-Logic-based ADC-Less SRAM Compute-in-Memory Macro in 28nm with Reconfigurable Bitwise Operation for AI and Embedded Applications. In Proceedings of the ISSCC, 2022.
  18. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.
  19. Deep Residual Learning for Image Recognition. In Proceedings of the CVPR, 2016.
  20. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the CVPR, 2018.
  21. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the ICML, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Cenlin Duan (5 papers)
  2. Jianlei Yang (32 papers)
  3. Yiou Wang (4 papers)
  4. Yikun Wang (25 papers)
  5. Yingjie Qi (13 papers)
  6. Xiaolin He (5 papers)
  7. Bonan Yan (10 papers)
  8. Xueyan Wang (16 papers)
  9. Xiaotao Jia (11 papers)
  10. Weisheng Zhao (143 papers)

Summary

Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity

This paper presents an innovative approach to enhancing the efficiency of SRAM-based processing-in-memory (PIM) architectures by leveraging unstructured bit-level sparsity. Traditional digital SRAM-PIM architectures face significant challenges in exploiting such sparsity due to their inherent crossbar structure, which restricts data routing and leads to inefficient utilization of randomly distributed zero-bits. To address these limitations, the authors propose the Dyadic Block PIM (DB-PIM), a co-design framework that couples algorithms with architectural innovations.

Core Contributions

  1. Algorithmic Innovation:
    • The authors introduce a Fixed Threshold Approximation (FTA) algorithm alongside a unique sparsity pattern termed the Dyadic Block (DB). This pattern involves partitioning an 8-bit binary number into four blocks, each containing two bits, which facilitates efficient bit-level operations. The FTA algorithm further enforces a uniform threshold for non-zero bits, preserving accuracy and enhancing regularity within neural network weights. The use of Canonical Signed Digit (CSD) encoding enhances the sparsity level by reducing the number of non-zero bits.
  2. Architectural Design:
    • The proposed architecture features customized PIM macros that integrate Dyadic Block multiplication units (DBMUs) and CSD-based adder trees optimized for Multiply-Accumulate (MAC) operations. Additionally, an input pre-processing unit (IPU) dynamically detects and bypasses all-zero-bit blocks, further enhancing computational efficiency. This architecture allows simultaneous storage and computation of complementary states stored in 6T SRAM cells, effectively utilizing formerly inactive crossbars.

Experimental Evaluation

The authors conducted comprehensive evaluations on several deep neural network (DNN) models, including both standard and compact architectures such as AlexNet and MobileNetV2. Their results indicate that the DB-PIM framework achieves up to a 7.69-fold speedup and energy savings of 83.43%, compared to traditional sparse neural network acceleration techniques. This performance stems from DB-PIM's ability to increase actual utilization of SRAM cells significantly, reaching utilization rates of up to 98.42% in dense computational scenarios.

Implications and Future Work

The DB-PIM framework demonstrates substantial improvements in efficiency and utilization, indicating its potential impact on both theoretical and practical aspects of PIM system design. By effectively leveraging bit-level sparsity, this framework offers a pathway to improve processing capabilities in resource-constrained environments, particularly for edge applications where efficiency is critical.

Future developments may focus on integrating this approach with existing value-level sparsity strategies, aiming to maximize overlays of multi-dimensional sparsity. Additionally, exploring applications in broader AI contexts, such as natural language processing or multimodal fusion systems, could further reinforce the versatility and robustness of this framework. As the landscape of AI continues to evolve, such synergy between algorithmic ingenuity and architectural design will be pivotal in overcoming emerging computational challenges.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com