A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow (2407.19449v4)

Published 28 Jul 2024 in cs.AR

Abstract: FPGA accelerators for lightweight neural convolutional networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimization. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and unified resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory and parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform using MobileNetV2 and ShuffleNetV2.Implementation results demonstrate that the proposed accelerator can save up to 68.3% of on-chip memory size with reduced off-chip memory access compared to the reference design. It achieves an impressive performance of up to 2092.4 FPS and a state-of-the-art MAC efficiency of up to 94.58%, while maintaining a high DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.

Authors (7)

Zhiyuan Zhao (55 papers)
Yihao Chen (40 papers)
Pengcheng Feng (2 papers)
Jixing Li (8 papers)
Gang Chen (592 papers)
Rongxuan Shen (2 papers)
Huaxiang Lu (6 papers)

Summary

The paper introduces a novel multi-computing-engine design that distinguishes between shallow and deep network layers to minimize off-chip memory access.
It implements a balanced dataflow strategy using fine-grained parallel mechanisms and a dataflow-oriented line buffer scheme to mitigate computational bottlenecks.
Performance evaluations on MobileNetV2 and ShuffleNetV2 show up to 2092.4 FPS and 94.58% MAC efficiency, with a 68.3% reduction in on-chip memory usage.

An FPGA Accelerator for Efficient Execution of Lightweight CNNs

The paper "A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow" addresses the challenges in accelerating lightweight convolutional neural networks (LWCNNs) on FPGAs. The authors propose a novel multi-computing-engine (multi-CE) architecture designed to optimize both memory usage and computational efficiency, highlighting its performance superiority on popular LWCNN models such as MobileNetV2 and ShuffleNetV2.

Research Contributions and Methodology

Multi-Computing-Engine Architecture: The authors identify inefficiencies in existing single-computing-engine designs, such as high on-chip/off-chip memory overhead and suboptimal computational efficiency. To address these, they propose a multi-CE-based accelerator with a balanced dataflow strategy. The architecture differentiates between feature map reused computing engines (FRCE) for shallow network layers and weight reused computing engines (WRCE) for deeper layers. This differentiation minimizes off-chip memory access and reduces on-chip memory footprint.
Balanced Dataflow Strategy: A key feature of the proposed accelerator is its balanced dataflow strategy, designed to enhance computational efficiency. The paper describes a fine-grained parallel mechanism and a dataflow-oriented line buffer scheme, which mitigate data congestion and provide efficient resource mapping across different layers of the network.
Resource Allocation and Scalability: The authors introduce a resource-aware memory and parallelism allocation methodology. This approach leverages a performance model to optimize system performance and scalability, adjusting DSP utilization and CE types in response to the specific resource requirements of different network layers. These adjustments ensure maximized throughput and computational efficiency.

Performance Evaluation

The proposed accelerator was implemented on a Xilinx ZC706 platform and evaluated using MobileNetV2 and ShuffleNetV2. The results demonstrate that the accelerator significantly outperforms other state-of-the-art LWCNN accelerators, achieving up to 2092.4 frames per second (FPS) and a MAC efficiency of 94.58%. Key improvements include a 68.3% reduction in on-chip memory size and reduced off-chip memory access, compared to benchmark architectures.

Implications and Future Directions

The architecture not only enhances the performance of current LWCNN models but also provides a scalable framework capable of adapting to varying FPGA resources and deep learning models. The balance between memory optimization and computation efficiency heralds a promising direction for deploying CNNs in resource-constrained environments, such as edge devices and real-time systems.

In terms of future developments, the ideas in this paper could stimulate further research into hardware-specific optimizations for increasingly diverse neural network architectures. The methodology for resource allocation and parallelism tuning can be expanded to explore new trade-offs between latency, power consumption, and throughput on next-generation FPGAs.

The paper contributes substantively to the discourse on efficient hardware acceleration for machine learning, offering insights that align well with the practical needs of deploying deep learning solutions in constrained hardware environments. The alignment of architecture with algorithmic properties of LWCNNs shows a strategic approach that maximizes both hardware and application-specific performance metrics.

PDF Markdown

Related Papers

YouTube

Show All Videos