A CNN Accelerator on FPGA Using Depthwise Separable Convolution (1809.01536v2)

Published 3 Sep 2018 in eess.SP and cs.AR

Abstract: Convolutional neural networks (CNNs) have been widely deployed in the fields of computer vision and pattern recognition because of their high accuracy. However, large convolution operations are computing-intensive that often requires a powerful computing platform such as Graphics Processing Unit (GPU). This makes it difficult to apply CNNs to portable devices. The state-of-the-art CNNs, such as MobileNetV2 and Xception, adopt depthwise separable convolution to replace the standard convolution for embedded platforms. That significantly reduces operations and parameters with only limited loss in accuracy. This highly structured model is very suitable for Field-Programmable Gate Array (FPGA) implementation. In this paper, a scalable high performance depthwise separable convolution optimized CNN accelerator is proposed. The accelerator can be fit into an FPGA of different sizes, provided the balancing between hardware resources and processing speed. As an example, MobileNetV2 is implemented on Arria 10 SoC FPGA, and the results show this accelerator can classify each picture from ImageNet in 3.75ms, which is about 266.6 frames per second. This achieves 20x speedup if compared to CPU.

Authors (3)

Lin Bai (34 papers)
Yiming Zhao (51 papers)
Xinming Huang (34 papers)

Citations (176)

View on Semantic Scholar

Summary

The paper introduces a high-performance FPGA accelerator that employs depthwise separable convolution to reduce computational load and memory usage.
The design, exemplified by MobileNetV2 on an Arria 10 SoC, achieves 266.6 fps and 170.6 GOPS, offering a 20x speedup over CPU implementations.
The accelerator’s adaptable architecture leverages 16-bit quantization and parallel processing, enabling energy-efficient CNN deployment on portable devices.

A CNN Accelerator on FPGA Using Depthwise Separable Convolution

This paper proposes a scalable high-performance accelerator for convolutional neural networks (CNNs) implemented on Field-Programmable Gate Arrays (FPGAs) with a refined focus on depthwise separable convolutional layers. The accelerator's adaptable architecture facilitates deployment on various FPGAs by balancing hardware usage against processing speed, representing a significant step forward in enabling CNN functionality for portable and low-power devices, circumventing the limitations imposed by traditional GPU reliance.

The paper addresses the computational intensity of CNNs, traditionally managed through GPUs, which are hindered by suboptimal power consumption profiles for embedded applications. The introduction of depthwise separable convolutions in state-of-the-art models like MobileNetV2 has reduced computational costs and parameter storage requirements, making CNNs more viable for FPGA environments. These separable operations fragment the convolution process into depthwise and pointwise stages, drastically reducing the mathematical operations and memory usage compared to standard convolutions.

A notable application showcased in this research is the implementation of MobileNetV2 on an Arria 10 SoC FPGA platform. The exemplar system performs image classification at a rate of 266.6 frames per second (fps), translating to 3.75 milliseconds per picture on the ImageNet dataset, achieving a 20x speedup over CPU implementations. This efficiency is bolstered by the accelerator's architecture, which includes a matrix multiplication engine (MME) capable of handling operations across various CNN layers and employing a hierarchical memory structure with a ping-pong on-chip buffer to optimize memory bandwidth constraints.

A compelling result within this research is the achieved 170.6 Giga Operations Per Second (GOPS) at a system clock frequency of 133 MHz, showcasing the architecture's high computational capability. The paper also emphasizes the benefits of utilizing a 16-bit quantization scheme, which preserves the computational efficiency and accuracy required by CNNs while reducing device resource usage. The FPGA design includes 4 MMEs, each integrated with line buffers, multipliers, adder trees, and optional stages for normalization, ReLU, and pooling, emphasizing parallel processing and reducing latency.

The research notably underscores the trade-offs between high power efficiency and reconfigurability made possible by the FPGA's intrinsic adaptability, contrasting with fixed architectures like ASICs. It underscores a strategic pivot in CNN deployment by leveraging FPGAs’ flexibility for real-time processing applications, like autonomous driving and mobile devices, where energy efficiency and performance are paramount.

In summation, the proposed FPGA-based accelerator delivers substantial improvements in processing speed and energy efficiency for CNN applications employing depthwise separable convolution. While the primary implementation concerned MobileNetV2, the architecture is versatile enough to be adapted to other CNN models. Future research directions may involve expanding the scalability for different FPGA platforms and exploring the integration of additional neural network optimizations to further enhance operational throughput and accuracy in the context of ever-evolving neural network models.

PDF Markdown

A CNN Accelerator on FPGA Using Depthwise Separable Convolution (1809.01536v2)

Summary

A CNN Accelerator on FPGA Using Depthwise Separable Convolution

Related Papers