Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA (2407.02362v2)
Abstract: Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
- Y. Hu, Y. Liu, and Z. Liu, “A survey on convolutional neural network accelerators: Gpu, fpga and asic,” 2022 ICCRD 2022, pp. 100–107, 2022.
- C. Gao et al., “Application level resource scheduling for deep learning acceleration on mpsoc,” Journal of Signal Processing Systems, vol. 95, no. 10, pp. 1231–1243, 2023.
- X. Zhu et al., “Bayesian optimization for efficient heterogeneous mpsoc based dnn accelerator runtime tuning,” in 2023 FPL, 2023, pp. 355–356.
- C. Gao et al., “Modelling and analysis of fpga-based mpsoc system with multiple dnn accelerators,” in 2023 21st IEEE NEWCAS, 2023, pp. 1–5.
- A. Rahim et al., “Enhancing smart home security: Anomaly detection and face recognition in smart home iot devices using logit-boosted cnn models,” Sensors, vol. 23, no. 15, p. 6979, 2023.
- S. A. Peixoto et al., “A high-efficiency energy and storage approach for iot applications of facial recognition,” Image and Vision Computing, vol. 96, p. 103899, 2020.
- M. Straczkiewicz, P. James, and J.-P. Onnela, “A systematic review of smartphone-based human activity recognition for health research,” arXiv preprint arXiv:1910.03970, 2019.
- J. Lee et al., “Resource-efficient convolutional networks: A survey on model-, arithmetic-, and implementation-level techniques,” ACM Comput. Surv., vol. 55, no. 13s, jul 2023.
- S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on fpga,” Neurocomputing, vol. 275, pp. 1072–1086, 1 2018.
- M. Blott et al., “Finn-r: An end-to-end deep-learning framework for fast exploration of quantized neural networks,” ACM TRTS, vol. 11, 2018.
- Y. Umuroglu and N. J. Fraser, “Finn: A framework for fast, scalable binarized neural network inference,” FPGA 2017, pp. 65–74, 2017.
- F. Hamanaka, T. Odan, K. Kise, and T. V. Chu, “An exploration of state-of-the-art automation frameworks for fpga-based dnn acceleration,” IEEE Access, vol. 11, pp. 5701–5713, 2023.
- E. Wang et al., “Lutnet: Learning fpga configurations for highly efficient neural network inference,” IEEE Transactions on Computers, vol. 69, pp. 1795–1808, 2020.
- Y. Zhang and J. Pan, “Fracbnn: Accurate and fpga-efficient binary neural networks with fractional activations,” FPGA 2021, pp. 171–182, 2021.
- D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” in ICML. PMLR, 2021, pp. 992–1004.
- H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE TPAMI, vol. 33, no. 1, pp. 117–128, Jan 2011.
- R. Guo, S. Kumar, K. Choromanski, and D. Simcha, “Quantization based fast inner product search,” in Artificial intelligence and statistics. PMLR, 2016, pp. 482–490.
- J. Schönleber, et al., “Stella nera: Achieving 161 top/s/w with multiplier-free dnn acceleration based on approximate matrix multiplication,” arXiv preprint arXiv:2311.10207, 2023.
- X. Tang, et al., “Lut-nn: Empower efficient neural network inference with centroid learning and table lookup,” in Proceedings of the 29th MobiCom, 2023, pp. 1–15.
- Y. Umuroglu and M. Jahre, “Streamlined deployment for quantized neural networks,” arXiv preprint arXiv:1709.04060, 2017.
- T. Aarrestad et al., “Fast convolutional neural networks on fpgas with hls4ml,” 1 2021.
- S. Basalama et al., “Flexcnn: An end-to-end framework for composing cnn accelerators on fpga,” ACM Trans. Reconfigurable Technol. Syst., vol. 16, no. 2, mar 2023.
- K. Guo et al., “Angel-eye: A complete design flow for mapping cnn onto embedded fpga,” IEEE TCAD, vol. 37, no. 1, pp. 35–47, 2017.
- X. Zhang et al., “Dnnbuilder: An automated tool for building high-performance dnn hardware accelerators for fpgas,” in 2018 IEEE/ACM ICCAD. IEEE, 2018, pp. 1–8.
- H. Ye et al., “Hybriddnn: A framework for high-performance hybrid dnn accelerator design and implementation,” in 2020 57th ACM/IEEE DAC, 2020, pp. 1–6.
- L. D. Bereholschi, C.-C. Lin, M. Yayla, and J.-J. Chen, “Hep-bnn: A framework for finding low-latency execution configurations of bnns on heterogeneous multiprocessor platforms,” 2023.
- C. Li et al., “Pim-dl: Expanding the applicability of commodity dram-pims for deep learning via algorithm-system co-optimization,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 879–896.
- R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in IJCAI, vol. 14, no. 2. Montreal, Canada, 1995, pp. 1137–1145.
- E. Wang et al., “Lutnet: Rethinking inference in fpga soft logic,” in 2019 FCCM. Los Alamitos, CA, USA: IEEE Computer Society, may 2019, pp. 26–34.
- A. Ahmad and M. A. Pasha, “Optimizing hardware accelerated general matrix-matrix multiplication for cnns on fpgas,” IEEE TCS II: Express Briefs, vol. 67, pp. 2692–2696, 2020.