EIE: Efficient Inference Engine on Compressed Deep Neural Network
Introduction
The paper "EIE: Efficient Inference Engine on Compressed Deep Neural Network" addresses the challenges of deploying deep neural networks (DNNs) on embedded systems with constrained hardware resources. This paper introduces EIE, an innovative hardware accelerator that significantly reduces energy consumption and improves the performance of DNN inference through the processing of compressed networks.
Background and Motivation
The further development and application of large-scale DNNs have underscored critical issues associated with their deployment, particularly on energy-constrained mobile and embedded devices. Modern DNNs, like AlexNet and VGG-16, consist of millions of parameters, making them computationally and memory intensive. The high energy cost of performing computations and memory accesses, especially from off-chip DRAM, necessitates efficient solutions.
Previous approaches, such as Deep Compression
, reduce storage requirements through network pruning and weight sharing, enabling large DNNs to fit in on-chip SRAM. However, efficiently utilizing these compressed models to achieve energy and computational efficiency remains an open problem. The paper proposes EIE, a specialized inference engine designed to exploit the sparsity and weight-sharing inherent in deeply compressed networks, thereby achieving significant energy and performance gains.
Key Contributions
The paper introduces several key contributions:
- Sparse and Weight-Sharing Accelerator: EIE is the first accelerator designed specifically for sparse and weight-sharing DNNs. It directly processes compressed networks, enabling large models to fit in on-chip SRAM, thereby reducing the energy cost of memory accesses by 120.
- Exploiting Activation Sparsity: EIE dynamically exploits the sparsity of activations, achieving additional energy savings by avoiding unnecessary computations. On average, 65.16% energy is saved by skipping computations involving zero activations in typical deep learning applications.
- Efficient Load Balancing: The architecture includes mechanisms to handle the irregular and dynamic nature of compressed DNNs. Distributed storage and computation across multiple processing elements (PEs) ensure load balance and scalability.
- Comprehensive Benchmarking: EIE was evaluated on nine DNN benchmarks, demonstrating substantial improvements in performance and energy efficiency compared to both CPU and GPU-based implementations. EIE processes fully-connected (FC) layers of AlexNet at a rate of 1.8810\textsuperscript{4} frames/sec with only 600mW power dissipation, showcasing its practical viability.
Architecture and Implementation
EIE is built around a scalable array of PEs, each responsible for a partition of the network stored in SRAM. Each PE operates on a compressed representation of the weight matrix utilizing a compressed sparse column (CSC) format tailored for dynamic sparsity in input activations. The architecture includes several components:
- Pointer Read Unit: Efficiently accesses compressed weight pointers.
- Sparse Matrix Read Unit: Fetches non-zero weights and handles weight sharing indices.
- Arithmetic Unit: Performs multiply-accumulate operations only for non-zero values.
- Activation Read/Write Unit: Manages input and output activations, supporting both register and SRAM storage.
The central control unit orchestrates the system, leveraging a distributed leading non-zero detection network to identify non-zero input activations for broadcasting to all PEs.
Numerical Results
The paper reports strong numerical results for EIE, evaluated against CPU and GPU platforms:
- Speedup: EIE achieves speedups of 189, 13, and 307 over CPU, GPU, and mobile GPU, respectively, on evaluated benchmarks.
- Energy Efficiency: EIE is 24,000, 3,400, and 2,700 more energy-efficient than CPU, GPU, and mobile GPU, respectively.
- Processing Power: The architecture delivers a processing power of 102 GOPS/s for compressed networks, correlating to 3 TOPS/s for uncompressed counterparts.
Practical and Theoretical Implications
EIE's ability to execute large DNNs within a constrained energy budget without sacrificing performance has significant practical implications for mobile and embedded AI applications. By directly working on compressed models, EIE opens avenues for real-time inference on edge devices, potentially influencing the deployment strategies of machine learning applications in resource-limited environments.
Theoretically, the paper contributes to the understanding of optimizing computational architectures for sparse and irregular data patterns, which could influence future hardware design principles for AI accelerators.
Future Developments
Advancements in AI hardware will likely continue focusing on handling the increasingly large scale and complexity of DNNs more energy-efficiently. Future developments might explore further compression techniques, integration with other accelerators, and support for additional neural network types beyond fully-connected layers, such as convolutional and recurrent layers. Furthermore, scaling EIE with more advanced process nodes and enhanced parallelism at the architectural level could yield even greater performance and energy efficiencies.
Conclusion
The paper presents EIE as a highly efficient inference engine for compressed deep neural networks, addressing the critical needs of energy-efficient DNN deployment on constrained hardware. Its architecture effectively leverages sparsity and weight-sharing, setting a new benchmark for performance and energy efficiency in AI hardware accelerators.