Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes (2402.14307v1)

Published 22 Feb 2024 in cs.AR and cs.LG

Abstract: Convolutional neural networks (CNNs) with large kernels, drawing inspiration from the key operations of vision transformers (ViTs), have demonstrated impressive performance in various vision-based applications. To address the issue of computational efficiency degradation in existing designs for supporting large-kernel convolutions, an FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. Firstly, a Z-flow method is presented to optimize the computing data flow by maximizing data reuse opportunity. Besides, the proposed design, incorporating the kernel-segmentation (Kseg) scheme, enables extended support for large-kernel convolutions, significantly reducing the storage requirements for overlapped data. Moreover, based on the analysis of typical block structures in emerging CNNs, vertical-fused (VF) and horizontal-fused (HF) methods are developed to optimize CNN deployments from both computation and transmission perspectives. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network. Particularly, it demonstrates efficient support for large-kernel CNNs, achieving throughputs of 169.68 GOPS and 244.55 GOPS for RepLKNet-31 and PyConvResNet-50, respectively, both of which are implemented on hardware for the first time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 ICCV, 2021, pp. 9992–10 002.
  2. X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your kernels to 31×31: Revisiting large kernel design in CNNs,” in 2022 CVPR, 2022, pp. 11 953–11 965.
  3. Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing the convolution operation to accelerate deep neural networks on FPGA,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354–1367, 2018.
  4. L. Rao, B. Zhang, and J. Zhao, “An energy-efficient accelerator for rain removal based on convolutional neural network,” IEEE TCAS-\@slowromancapii@: Express Briefs, vol. 68, no. 8, pp. 2957–2961, 2021.
  5. J. Li, K.-F. Un, W.-H. Yu, P.-I. Mak, and R. P. Martins, “An FPGA-based energy-efficient reconfigurable convolutional neural network accelerator for object recognition applications,” IEEE TCAS-\@slowromancapii@: Express Briefs, vol. 68, no. 9, pp. 3143–3147, 2021.
  6. L. Rao, B. Zhang, and J. Zhao, “Hardware implementation of reconfigurable separable convolution,” in 2018 ISVLSI, 2018, pp. 232–237.
  7. M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” ArXiv, vol. abs/1905.11946, 2019.
  8. H. Zhang, Y. Jin, and K. Hao, “Evolutionary search for complete neural network architectures with partial weight sharing,” IEEE Transactions on Evolutionary Computation, vol. 26, no. 5, pp. 1072–1086, 2022.
  9. S. Yan et al., “An FPGA-based mobilenet accelerator considering network structure characteristics,” in 2021 FPL, 2021, pp. 17–23.
  10. Z. Liu, Q. Liu, S. Yan, and R. C. Cheung, “An efficient FPGA-based depthwise separable convolutional neural network accelerator with hardware pruning,” ACM Trans. Reconfigurable Technol. Syst., sep 2023, just Accepted. [Online]. Available: https://doi.org/10.1145/3615661
  11. C. Fang, L. He, H. Wang, J. Wei, and Z. Wang, “Accelerating 3D convolutional neural networks using 3D fast fourier transform,” in 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, pp. 1–5.
  12. I. Cosmin Duta, L. Liu, F. Zhu, and L. Shao, “Pyramidal convolution: Rethinking convolutional neural networks for visual recognition,” arXiv e-prints, p. arXiv:2006.11538, Jun. 2020.
  13. Y. Yu, T. Zhao, K. Wang, and L. He, “Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks,” in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’20.   New York, NY, USA: Association for Computing Machinery, 2020, p. 122–132. [Online]. Available: https://doi.org/10.1145/3373087.3375311
  14. X. Wu, Y. Ma, M. Wang, and Z. Wang, “A flexible and efficient FPGA accelerator for various large-scale and lightweight CNNs,” IEEE TCAS-\@slowromancapi@: Regular Papers, vol. 69, no. 3, pp. 1185–1198, 2022.
  15. D. Kim, S. Jeong, and J.-Y. Kim, “Agamotto: A performance optimization framework for CNN accelerator with row stationary dataflow,” IEEE TCAS-\@slowromancapi@: Regular Papers, vol. 70, no. 6, pp. 2487–2496, 2023.
  16. T. Li, F. Zhang, X. Fan, J. Shen, W. Guo, and W. Cao, “Unified accelerator for attention and convolution in inference based on FPGA,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5.

Summary

We haven't generated a summary for this paper yet.