- The paper introduces a novel multi-computing-engine design that distinguishes between shallow and deep network layers to minimize off-chip memory access.
- It implements a balanced dataflow strategy using fine-grained parallel mechanisms and a dataflow-oriented line buffer scheme to mitigate computational bottlenecks.
- Performance evaluations on MobileNetV2 and ShuffleNetV2 show up to 2092.4 FPS and 94.58% MAC efficiency, with a 68.3% reduction in on-chip memory usage.
An FPGA Accelerator for Efficient Execution of Lightweight CNNs
The paper "A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow" addresses the challenges in accelerating lightweight convolutional neural networks (LWCNNs) on FPGAs. The authors propose a novel multi-computing-engine (multi-CE) architecture designed to optimize both memory usage and computational efficiency, highlighting its performance superiority on popular LWCNN models such as MobileNetV2 and ShuffleNetV2.
Research Contributions and Methodology
- Multi-Computing-Engine Architecture: The authors identify inefficiencies in existing single-computing-engine designs, such as high on-chip/off-chip memory overhead and suboptimal computational efficiency. To address these, they propose a multi-CE-based accelerator with a balanced dataflow strategy. The architecture differentiates between feature map reused computing engines (FRCE) for shallow network layers and weight reused computing engines (WRCE) for deeper layers. This differentiation minimizes off-chip memory access and reduces on-chip memory footprint.
- Balanced Dataflow Strategy: A key feature of the proposed accelerator is its balanced dataflow strategy, designed to enhance computational efficiency. The paper describes a fine-grained parallel mechanism and a dataflow-oriented line buffer scheme, which mitigate data congestion and provide efficient resource mapping across different layers of the network.
- Resource Allocation and Scalability: The authors introduce a resource-aware memory and parallelism allocation methodology. This approach leverages a performance model to optimize system performance and scalability, adjusting DSP utilization and CE types in response to the specific resource requirements of different network layers. These adjustments ensure maximized throughput and computational efficiency.
Performance Evaluation
The proposed accelerator was implemented on a Xilinx ZC706 platform and evaluated using MobileNetV2 and ShuffleNetV2. The results demonstrate that the accelerator significantly outperforms other state-of-the-art LWCNN accelerators, achieving up to 2092.4 frames per second (FPS) and a MAC efficiency of 94.58%. Key improvements include a 68.3% reduction in on-chip memory size and reduced off-chip memory access, compared to benchmark architectures.
Implications and Future Directions
The architecture not only enhances the performance of current LWCNN models but also provides a scalable framework capable of adapting to varying FPGA resources and deep learning models. The balance between memory optimization and computation efficiency heralds a promising direction for deploying CNNs in resource-constrained environments, such as edge devices and real-time systems.
In terms of future developments, the ideas in this paper could stimulate further research into hardware-specific optimizations for increasingly diverse neural network architectures. The methodology for resource allocation and parallelism tuning can be expanded to explore new trade-offs between latency, power consumption, and throughput on next-generation FPGAs.
The paper contributes substantively to the discourse on efficient hardware acceleration for machine learning, offering insights that align well with the practical needs of deploying deep learning solutions in constrained hardware environments. The alignment of architecture with algorithmic properties of LWCNNs shows a strategic approach that maximizes both hardware and application-specific performance metrics.