- The paper proposes a reconfigurable streaming deep convolutional neural network accelerator designed for low-power IoT devices, enabling efficient local AI processing without cloud dependency.
- A novel filter decomposition technique allows the accelerator to handle varying kernel sizes and natively supports pooling functions, enhancing reconfigurability and minimizing data movement.
- Verified with prominent CNNs (AlexNet, ResNet, Inception V3), the accelerator achieves high energy efficiency (434 GOPS/W) and throughput (152 GOPS), proving its viability for edge AI applications like image recognition.
A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for IoT
This paper proposes an innovative hardware accelerator design tailored for executing convolutional neural network (CNN) tasks within the constraints of Internet of Things (IoT) devices. The authors address key IoT challenges, namely the high energy consumption, latency, and the need for network connectivity associated with conventional cloud-based AI processing. By introducing a localized AI processing scheme, the accelerator effectively performs CNN computations directly on IoT devices, thus minimizing the need for data transmission and optimizing energy efficiency.
The proposed architecture leverages a streaming data flow approach, significantly reducing unnecessary data movement and achieving superior power efficiency. This efficiency is quantified with a peak throughput of 152 GOPS and an energy efficiency of 434 GOPS/W at a power consumption rate of 350mW. Such figures are indicative of the accelerator's potential to operate effectively within stringent power and performance budgets typical in IoT applications.
A notable contribution of this paper is its filter decomposition technique, which enables the accelerator to perform convolution operations for varying kernel sizes without additional hardware overhead. The accelerator supports both pooling functions (max and average) natively, enhancing its ability to fully compute CNNs without external processor intervention. This integration not only eliminates the extra data movement typically associated with pooling in CNN accelerators but also offers reconfigurability.
The accelerator was verified using an FPGA prototype and demonstrated its ability to support prominent CNN architectures such as AlexNet, ResNet, and Inception V3. The paper illustrates that with this architecture, IoT devices can achieve locally executed high-performance AI computations, including image recognition and smart security applications, without reliance on cloud services.
Comparative analysis with existing works, as presented in the paper, highlights the accelerator's competitive advantage in terms of energy efficiency and area cost. The architecture's design integrates a ping-pong buffer and a simplified convolution unit, optimizing its operational speed without sacrificing computational capability, crucial for real-world IoT deployment.
Theoretical implications of this research suggest potential advancements in embedded systems where energy and performance are critical. Practical applications extend to various IoT domains, supporting autonomous operation in environments lacking consistent network connectivity. Future work may explore further optimization on data formats, possibly leveraging lower precision arithmetic to enhance efficiency without compromising result fidelity.
Conclusively, this paper contributes significantly to the field of IoT edge computing by providing a robust and efficient hardware solution for CNN acceleration. The architecture sets a precedent for developing intelligent, autonomous IoT devices capable of local processing, advancing the capability of IoT ecosystems.