Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things (1707.02973v1)

Published 8 Jul 2017 in cs.CV and cs.AR

Abstract: Convolutional neural network (CNN) offers significant accuracy in image detection. To implement image detection using CNN in the internet of things (IoT) devices, a streaming hardware accelerator is proposed. The proposed accelerator optimizes the energy efficiency by avoiding unnecessary data movement. With unique filter decomposition technique, the accelerator can support arbitrary convolution window size. In addition, max pooling function can be computed in parallel with convolution by using separate pooling unit, thus achieving throughput improvement. A prototype accelerator was implemented in TSMC 65nm technology with a core size of 5mm2. The accelerator can support major CNNs and achieve 152GOPS peak throughput and 434GOPS/W energy efficiency at 350mW, making it a promising hardware accelerator for intelligent IoT devices.

Citations (165)

Summary

  • The paper proposes a reconfigurable streaming deep convolutional neural network accelerator designed for low-power IoT devices, enabling efficient local AI processing without cloud dependency.
  • A novel filter decomposition technique allows the accelerator to handle varying kernel sizes and natively supports pooling functions, enhancing reconfigurability and minimizing data movement.
  • Verified with prominent CNNs (AlexNet, ResNet, Inception V3), the accelerator achieves high energy efficiency (434 GOPS/W) and throughput (152 GOPS), proving its viability for edge AI applications like image recognition.

A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for IoT

This paper proposes an innovative hardware accelerator design tailored for executing convolutional neural network (CNN) tasks within the constraints of Internet of Things (IoT) devices. The authors address key IoT challenges, namely the high energy consumption, latency, and the need for network connectivity associated with conventional cloud-based AI processing. By introducing a localized AI processing scheme, the accelerator effectively performs CNN computations directly on IoT devices, thus minimizing the need for data transmission and optimizing energy efficiency.

The proposed architecture leverages a streaming data flow approach, significantly reducing unnecessary data movement and achieving superior power efficiency. This efficiency is quantified with a peak throughput of 152 GOPS and an energy efficiency of 434 GOPS/W at a power consumption rate of 350mW. Such figures are indicative of the accelerator's potential to operate effectively within stringent power and performance budgets typical in IoT applications.

A notable contribution of this paper is its filter decomposition technique, which enables the accelerator to perform convolution operations for varying kernel sizes without additional hardware overhead. The accelerator supports both pooling functions (max and average) natively, enhancing its ability to fully compute CNNs without external processor intervention. This integration not only eliminates the extra data movement typically associated with pooling in CNN accelerators but also offers reconfigurability.

The accelerator was verified using an FPGA prototype and demonstrated its ability to support prominent CNN architectures such as AlexNet, ResNet, and Inception V3. The paper illustrates that with this architecture, IoT devices can achieve locally executed high-performance AI computations, including image recognition and smart security applications, without reliance on cloud services.

Comparative analysis with existing works, as presented in the paper, highlights the accelerator's competitive advantage in terms of energy efficiency and area cost. The architecture's design integrates a ping-pong buffer and a simplified convolution unit, optimizing its operational speed without sacrificing computational capability, crucial for real-world IoT deployment.

Theoretical implications of this research suggest potential advancements in embedded systems where energy and performance are critical. Practical applications extend to various IoT domains, supporting autonomous operation in environments lacking consistent network connectivity. Future work may explore further optimization on data formats, possibly leveraging lower precision arithmetic to enhance efficiency without compromising result fidelity.

Conclusively, this paper contributes significantly to the field of IoT edge computing by providing a robust and efficient hardware solution for CNN acceleration. The architecture sets a precedent for developing intelligent, autonomous IoT devices capable of local processing, advancing the capability of IoT ecosystems.