Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An OpenCL(TM) Deep Learning Accelerator on Arria 10 (1701.03534v1)

Published 13 Jan 2017 in cs.DC, cs.AR, and cs.CV

Abstract: Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL(TM), which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020 img/s, or 23 img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FPGAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Utku Aydonat (1 paper)
  2. Shane O'Connell (1 paper)
  3. Davor Capalija (1 paper)
  4. Andrew C. Ling (2 papers)
  5. Gordon R. Chiu (2 papers)
Citations (238)

Summary

Analysis of an OpenCLâ„¢ Deep Learning Accelerator on FPGA

The paper "An OpenCLâ„¢ Deep Learning Accelerator on Arria" addresses the persistent challenge of executing convolutional neural networks (CNNs) efficiently on Field Programmable Gate Arrays (FPGAs). Despite FPGAs' potential for high efficiency in executing convolutional operations, past endeavors have struggled to demonstrate significant performance advantages over GPUs primarily due to memory bandwidth constraints. This work presents a novel Deep Learning Accelerator (DLA) architecture designed specifically to optimize CNN execution on FPGAs, which achieves notable performance improvements.

The authors introduce several key contributions aimed at overcoming the limitations of previous FPGA implementations for deep learning. This includes a methodology to minimize bandwidth requirements by caching intermediate feature-maps on-chip using stream-buffers, thereby significantly reducing external memory dependencies. Additionally, a design space exploration method is proposed, utilizing analytical models to configure the optimal architecture based on resource usage and throughput criteria. The novel application of the Winograd transform further enhances performance by reducing the number of multiply-accumulate operations necessary for convolutions.

The results showcase a substantial performance boost when running the AlexNet CNN benchmark on an Intel Arria 10 FPGA. The DLA achieves a processing speed of 1020 images per second and an efficiency of 23 images per second per watt, totaling 1382 GFLOPs. These metrics reflect a 10x higher throughput, surpassing the performance of contemporary FPGA implementations by a factor of 8.4x in GFLOPS, and a 5.8x improvement in energy efficiency when compared to previous state-of-the-art results for FPGAs.

Significantly, the DLA architecture's efficiency is competitive with NVIDIA's TitanX GPU concerning power utilization, a noteworthy result given the traditionally superior parallel processing capabilities of GPUs. This demonstrates that with appropriate architecture customization, FPGAs can rival dedicated GPU implementations not only in throughput but also in energy efficiency.

The implications of this research are significant for the deployment of CNNs in environments where power consumption is as critical as computational performance, such as embedded systems and mobile devices. The architecture's use of OpenCL also supports greater portability and ease of adoption across different FPGA devices, facilitating further research and advancements.

Future directions could investigate the adaptability of this architecture to other deep learning models beyond AlexNet, such as VGG or GoogLeNet, potentially benefiting from even greater performance improvements due to their different topological characteristics. The paper also opens avenues for the exploration of runtime reconfigurability as a means to further optimize FPGA resource utilization and application performance.

In conclusion, this research contributes a scalable and efficient approach to utilizing FPGAs for deep learning, leveraging advanced data handling techniques and mathematical transformations to overcome existing hardware limitations. As the demand for efficient AI computation continues to rise, approaches like this set a precedent for further optimizations in deep learning accelerators.