- The paper introduces a high-performance FPGA accelerator that employs depthwise separable convolution to reduce computational load and memory usage.
- The design, exemplified by MobileNetV2 on an Arria 10 SoC, achieves 266.6 fps and 170.6 GOPS, offering a 20x speedup over CPU implementations.
- The accelerator’s adaptable architecture leverages 16-bit quantization and parallel processing, enabling energy-efficient CNN deployment on portable devices.
A CNN Accelerator on FPGA Using Depthwise Separable Convolution
This paper proposes a scalable high-performance accelerator for convolutional neural networks (CNNs) implemented on Field-Programmable Gate Arrays (FPGAs) with a refined focus on depthwise separable convolutional layers. The accelerator's adaptable architecture facilitates deployment on various FPGAs by balancing hardware usage against processing speed, representing a significant step forward in enabling CNN functionality for portable and low-power devices, circumventing the limitations imposed by traditional GPU reliance.
The paper addresses the computational intensity of CNNs, traditionally managed through GPUs, which are hindered by suboptimal power consumption profiles for embedded applications. The introduction of depthwise separable convolutions in state-of-the-art models like MobileNetV2 has reduced computational costs and parameter storage requirements, making CNNs more viable for FPGA environments. These separable operations fragment the convolution process into depthwise and pointwise stages, drastically reducing the mathematical operations and memory usage compared to standard convolutions.
A notable application showcased in this research is the implementation of MobileNetV2 on an Arria 10 SoC FPGA platform. The exemplar system performs image classification at a rate of 266.6 frames per second (fps), translating to 3.75 milliseconds per picture on the ImageNet dataset, achieving a 20x speedup over CPU implementations. This efficiency is bolstered by the accelerator's architecture, which includes a matrix multiplication engine (MME) capable of handling operations across various CNN layers and employing a hierarchical memory structure with a ping-pong on-chip buffer to optimize memory bandwidth constraints.
A compelling result within this research is the achieved 170.6 Giga Operations Per Second (GOPS) at a system clock frequency of 133 MHz, showcasing the architecture's high computational capability. The paper also emphasizes the benefits of utilizing a 16-bit quantization scheme, which preserves the computational efficiency and accuracy required by CNNs while reducing device resource usage. The FPGA design includes 4 MMEs, each integrated with line buffers, multipliers, adder trees, and optional stages for normalization, ReLU, and pooling, emphasizing parallel processing and reducing latency.
The research notably underscores the trade-offs between high power efficiency and reconfigurability made possible by the FPGA's intrinsic adaptability, contrasting with fixed architectures like ASICs. It underscores a strategic pivot in CNN deployment by leveraging FPGAs’ flexibility for real-time processing applications, like autonomous driving and mobile devices, where energy efficiency and performance are paramount.
In summation, the proposed FPGA-based accelerator delivers substantial improvements in processing speed and energy efficiency for CNN applications employing depthwise separable convolution. While the primary implementation concerned MobileNetV2, the architecture is versatile enough to be adapted to other CNN models. Future research directions may involve expanding the scalability for different FPGA platforms and exploring the integration of additional neural network optimizations to further enhance operational throughput and accuracy in the context of ever-evolving neural network models.