Fast convolutional neural networks on FPGAs with hls4ml (2101.05108v2)

Published 13 Jan 2021 in cs.LG, cs.CV, hep-ex, physics.ins-det, and stat.ML

Abstract: We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on FPGAs. By extending the hls4ml library, we demonstrate an inference latency of $5\,\mu$s using convolutional architectures, targeting microsecond latency applications like those at the CERN Large Hadron Collider. Considering benchmark models trained on the Street View House Numbers Dataset, we demonstrate various methods for model compression in order to fit the computational constraints of a typical FPGA device used in trigger and data acquisition systems of particle detectors. In particular, we discuss pruning and quantization-aware training, and demonstrate how resource utilization can be significantly reduced with little to no loss in model accuracy. We show that the FPGA critical resource consumption can be reduced by 97% with zero loss in model accuracy, and by 99% when tolerating a 6% accuracy degradation.

Citations (88)

View on Semantic Scholar

Summary

The paper introduces a stream-based approach for convolutional layers that reduces latency to as low as 5 microseconds.
It achieves up to 97-99% FPGA resource savings while maintaining accuracy through pruning and quantization-aware training.
The method offers actionable insights for deploying efficient, low-latency deep learning inference on edge devices in high-energy physics and beyond.

Fast Convolutional Neural Networks on FPGAs with hls4ml

The paper presents an advancement in deploying convolutional neural networks (CNNs) on Field Programmable Gate Arrays (FPGAs) using an extension of the hls4ml library. The primary focus is achieving ultra-low-latency, low-power performance suitable for applications such as those at the CERN Large Hadron Collider (LHC), where timely data processing is crucial due to high event rates. The hls4ml library translates neural network models into FPGA firmware, emphasizing deployment that meets microsecond latency requirements.

Implementation of Convolutional Layers

The implementation leverages a stream-based approach to convolutional and pooling layers. This diverges from traditional deep learning hardware implementations that might use general matrix multiplication-like strategies requiring significant additional memory. Instead, this methodology facilitates adaptability to the constraints of fully-on-chip designs, minimizing the influence of data transmission latency typical in FPGA platforms. By manipulating streams and buffering data optimally, it allows efficient on-chip processing of convolutional kernels. This method efficiently manages resources, even compressing the necessary instruction sets for convolution operations through pre-computed masks.

Results and Performance

The paper demonstrates the practical application of these methods using the Street View House Numbers dataset, achieving a latency as low as 5 microseconds. The paper shows significant resource reductions with these techniques: up to 97% FPGA resource saving with no accuracy reduction and 99% when accommodating a 6% accuracy degradation. This is achieved through pruning, quantization-aware training (QAT), and other model compression techniques.

Resources and Efficiency

Various models were examined to determine the balance between resource consumption, latency, and accuracy. Quantization-aware training emerges as particularly effective, maintaining accuracy down to very low precisions (3-4 bits), substantially lowering DSP usage without sacrificing performance. The resource optimization facilitated by the approach is notable, considering the inherent limitations of FPGA resources like DSPs, LUTs, and BRAM.

Implications and Future Directions

The extension of hls4ml to handle CNNs on FPGAs offers significant implications for real-time deep learning inference in high-energy physics but also promises utility in broader edge-computing applications requiring efficiency and low latency, such as in autonomous systems or real-time monitoring equipment. The work suggests that future advances could include further resource optimization, compatibility with more complex architectures, and broader quantization strategies potentially enhancing the deployment of more sophisticated deep learning models on constrained hardware.

In conclusion, the paper provides a comprehensive examination of streamlining CNN deployment on FPGAs, offering a refined toolset through hls4ml that emphasizes resource efficiency and low latency, marking an important step for both specialized and general applications of AI on edge devices.