Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EIE: Efficient Inference Engine on Compressed Deep Neural Network (1602.01528v2)

Published 4 Feb 2016 in cs.CV and cs.AR

Abstract: State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x104 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Introduction

The paper "EIE: Efficient Inference Engine on Compressed Deep Neural Network" addresses the challenges of deploying deep neural networks (DNNs) on embedded systems with constrained hardware resources. This paper introduces EIE, an innovative hardware accelerator that significantly reduces energy consumption and improves the performance of DNN inference through the processing of compressed networks.

Background and Motivation

The further development and application of large-scale DNNs have underscored critical issues associated with their deployment, particularly on energy-constrained mobile and embedded devices. Modern DNNs, like AlexNet and VGG-16, consist of millions of parameters, making them computationally and memory intensive. The high energy cost of performing computations and memory accesses, especially from off-chip DRAM, necessitates efficient solutions.

Previous approaches, such as Deep Compression, reduce storage requirements through network pruning and weight sharing, enabling large DNNs to fit in on-chip SRAM. However, efficiently utilizing these compressed models to achieve energy and computational efficiency remains an open problem. The paper proposes EIE, a specialized inference engine designed to exploit the sparsity and weight-sharing inherent in deeply compressed networks, thereby achieving significant energy and performance gains.

Key Contributions

The paper introduces several key contributions:

  1. Sparse and Weight-Sharing Accelerator: EIE is the first accelerator designed specifically for sparse and weight-sharing DNNs. It directly processes compressed networks, enabling large models to fit in on-chip SRAM, thereby reducing the energy cost of memory accesses by 120×\times.
  2. Exploiting Activation Sparsity: EIE dynamically exploits the sparsity of activations, achieving additional energy savings by avoiding unnecessary computations. On average, 65.16% energy is saved by skipping computations involving zero activations in typical deep learning applications.
  3. Efficient Load Balancing: The architecture includes mechanisms to handle the irregular and dynamic nature of compressed DNNs. Distributed storage and computation across multiple processing elements (PEs) ensure load balance and scalability.
  4. Comprehensive Benchmarking: EIE was evaluated on nine DNN benchmarks, demonstrating substantial improvements in performance and energy efficiency compared to both CPU and GPU-based implementations. EIE processes fully-connected (FC) layers of AlexNet at a rate of 1.88×\times10\textsuperscript{4} frames/sec with only 600mW power dissipation, showcasing its practical viability.

Architecture and Implementation

EIE is built around a scalable array of PEs, each responsible for a partition of the network stored in SRAM. Each PE operates on a compressed representation of the weight matrix utilizing a compressed sparse column (CSC) format tailored for dynamic sparsity in input activations. The architecture includes several components:

  • Pointer Read Unit: Efficiently accesses compressed weight pointers.
  • Sparse Matrix Read Unit: Fetches non-zero weights and handles weight sharing indices.
  • Arithmetic Unit: Performs multiply-accumulate operations only for non-zero values.
  • Activation Read/Write Unit: Manages input and output activations, supporting both register and SRAM storage.

The central control unit orchestrates the system, leveraging a distributed leading non-zero detection network to identify non-zero input activations for broadcasting to all PEs.

Numerical Results

The paper reports strong numerical results for EIE, evaluated against CPU and GPU platforms:

  • Speedup: EIE achieves speedups of 189×\times, 13×\times, and 307×\times over CPU, GPU, and mobile GPU, respectively, on evaluated benchmarks.
  • Energy Efficiency: EIE is 24,000×\times, 3,400×\times, and 2,700×\times more energy-efficient than CPU, GPU, and mobile GPU, respectively.
  • Processing Power: The architecture delivers a processing power of 102 GOPS/s for compressed networks, correlating to 3 TOPS/s for uncompressed counterparts.

Practical and Theoretical Implications

EIE's ability to execute large DNNs within a constrained energy budget without sacrificing performance has significant practical implications for mobile and embedded AI applications. By directly working on compressed models, EIE opens avenues for real-time inference on edge devices, potentially influencing the deployment strategies of machine learning applications in resource-limited environments.

Theoretically, the paper contributes to the understanding of optimizing computational architectures for sparse and irregular data patterns, which could influence future hardware design principles for AI accelerators.

Future Developments

Advancements in AI hardware will likely continue focusing on handling the increasingly large scale and complexity of DNNs more energy-efficiently. Future developments might explore further compression techniques, integration with other accelerators, and support for additional neural network types beyond fully-connected layers, such as convolutional and recurrent layers. Furthermore, scaling EIE with more advanced process nodes and enhanced parallelism at the architectural level could yield even greater performance and energy efficiencies.

Conclusion

The paper presents EIE as a highly efficient inference engine for compressed deep neural networks, addressing the critical needs of energy-efficient DNN deployment on constrained hardware. Its architecture effectively leverages sparsity and weight-sharing, setting a new benchmark for performance and energy efficiency in AI hardware accelerators.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Song Han (155 papers)
  2. Xingyu Liu (56 papers)
  3. Huizi Mao (13 papers)
  4. Jing Pu (7 papers)
  5. Ardavan Pedram (9 papers)
  6. Mark A. Horowitz (3 papers)
  7. William J. Dally (21 papers)
Citations (2,376)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com