Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA (2407.02362v2)

Published 2 Jul 2024 in cs.AR, cs.AI, and cs.LG

Abstract: Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Y. Hu, Y. Liu, and Z. Liu, “A survey on convolutional neural network accelerators: Gpu, fpga and asic,” 2022 ICCRD 2022, pp. 100–107, 2022.
  2. C. Gao et al., “Application level resource scheduling for deep learning acceleration on mpsoc,” Journal of Signal Processing Systems, vol. 95, no. 10, pp. 1231–1243, 2023.
  3. X. Zhu et al., “Bayesian optimization for efficient heterogeneous mpsoc based dnn accelerator runtime tuning,” in 2023 FPL, 2023, pp. 355–356.
  4. C. Gao et al., “Modelling and analysis of fpga-based mpsoc system with multiple dnn accelerators,” in 2023 21st IEEE NEWCAS, 2023, pp. 1–5.
  5. A. Rahim et al., “Enhancing smart home security: Anomaly detection and face recognition in smart home iot devices using logit-boosted cnn models,” Sensors, vol. 23, no. 15, p. 6979, 2023.
  6. S. A. Peixoto et al., “A high-efficiency energy and storage approach for iot applications of facial recognition,” Image and Vision Computing, vol. 96, p. 103899, 2020.
  7. M. Straczkiewicz, P. James, and J.-P. Onnela, “A systematic review of smartphone-based human activity recognition for health research,” arXiv preprint arXiv:1910.03970, 2019.
  8. J. Lee et al., “Resource-efficient convolutional networks: A survey on model-, arithmetic-, and implementation-level techniques,” ACM Comput. Surv., vol. 55, no. 13s, jul 2023.
  9. S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on fpga,” Neurocomputing, vol. 275, pp. 1072–1086, 1 2018.
  10. M. Blott et al., “Finn-r: An end-to-end deep-learning framework for fast exploration of quantized neural networks,” ACM TRTS, vol. 11, 2018.
  11. Y. Umuroglu and N. J. Fraser, “Finn: A framework for fast, scalable binarized neural network inference,” FPGA 2017, pp. 65–74, 2017.
  12. F. Hamanaka, T. Odan, K. Kise, and T. V. Chu, “An exploration of state-of-the-art automation frameworks for fpga-based dnn acceleration,” IEEE Access, vol. 11, pp. 5701–5713, 2023.
  13. E. Wang et al., “Lutnet: Learning fpga configurations for highly efficient neural network inference,” IEEE Transactions on Computers, vol. 69, pp. 1795–1808, 2020.
  14. Y. Zhang and J. Pan, “Fracbnn: Accurate and fpga-efficient binary neural networks with fractional activations,” FPGA 2021, pp. 171–182, 2021.
  15. D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” in ICML.   PMLR, 2021, pp. 992–1004.
  16. H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE TPAMI, vol. 33, no. 1, pp. 117–128, Jan 2011.
  17. R. Guo, S. Kumar, K. Choromanski, and D. Simcha, “Quantization based fast inner product search,” in Artificial intelligence and statistics.   PMLR, 2016, pp. 482–490.
  18. J. Schönleber, et al., “Stella nera: Achieving 161 top/s/w with multiplier-free dnn acceleration based on approximate matrix multiplication,” arXiv preprint arXiv:2311.10207, 2023.
  19. X. Tang, et al., “Lut-nn: Empower efficient neural network inference with centroid learning and table lookup,” in Proceedings of the 29th MobiCom, 2023, pp. 1–15.
  20. Y. Umuroglu and M. Jahre, “Streamlined deployment for quantized neural networks,” arXiv preprint arXiv:1709.04060, 2017.
  21. T. Aarrestad et al., “Fast convolutional neural networks on fpgas with hls4ml,” 1 2021.
  22. S. Basalama et al., “Flexcnn: An end-to-end framework for composing cnn accelerators on fpga,” ACM Trans. Reconfigurable Technol. Syst., vol. 16, no. 2, mar 2023.
  23. K. Guo et al., “Angel-eye: A complete design flow for mapping cnn onto embedded fpga,” IEEE TCAD, vol. 37, no. 1, pp. 35–47, 2017.
  24. X. Zhang et al., “Dnnbuilder: An automated tool for building high-performance dnn hardware accelerators for fpgas,” in 2018 IEEE/ACM ICCAD.   IEEE, 2018, pp. 1–8.
  25. H. Ye et al., “Hybriddnn: A framework for high-performance hybrid dnn accelerator design and implementation,” in 2020 57th ACM/IEEE DAC, 2020, pp. 1–6.
  26. L. D. Bereholschi, C.-C. Lin, M. Yayla, and J.-J. Chen, “Hep-bnn: A framework for finding low-latency execution configurations of bnns on heterogeneous multiprocessor platforms,” 2023.
  27. C. Li et al., “Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24.   New York, NY, USA: Association for Computing Machinery, 2024, p. 879–896.
  28. R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in IJCAI, vol. 14, no. 2.   Montreal, Canada, 1995, pp. 1137–1145.
  29. E. Wang et al., “Lutnet: Rethinking inference in fpga soft logic,” in 2019 FCCM.   Los Alamitos, CA, USA: IEEE Computer Society, may 2019, pp. 26–34.
  30. A. Ahmad and M. A. Pasha, “Optimizing hardware accelerated general matrix-matrix multiplication for cnns on fpgas,” IEEE TCS II: Express Briefs, vol. 67, pp. 2692–2696, 2020.

Summary

We haven't generated a summary for this paper yet.