A method for accelerating low precision operations by sparse matrix multiplication (2403.06924v1)
Abstract: In recent years, the fervent demand for computational power across various domains has prompted hardware manufacturers to introduce specialized computing hardware aimed at enhancing computational capabilities. Particularly, the utilization of tensor hardware supporting low precision has gained increasing prominence in scientific research. However, the use of low-precision tensor hardware for computational acceleration often introduces errors, posing a fundamental challenge of simultaneously achieving effective acceleration while maintaining computational accuracy. This paper proposes improvements in the methodology by incorporating low-precision quantization and employing a residual matrix for error correction and combines vector-wise quantization method.. The key innovation lies in the use of sparse matrices instead of dense matrices when compensating for errors with a residual matrix. By focusing solely on values that may significantly impact relative errors under a specified threshold, this approach aims to control quantization errors while reducing computational complexity. Experimental results demonstrate that this method can effectively control the quantization error while maintaining high acceleration effect.The improved algorithm on the CPU can achieve up to 15\% accuracy improvement while 1.46 times speed improvement.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 1–436, 2015.
- Intel. (2017) Intel nervana neural network processors. [Online]. Available: https://www.intel.ai/nervana-nnp/
- NVIDIA. (2017) Nvidia tensor cores. [Online]. Available: https: //www.nvidia.com/en-us/data-center/tensorcore/
- NVIDIA. (2020) [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
- NVIDIA. (2020) [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
- NVIDIA. (2020) [Online]. Available: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
- J. Cheng, P. Wang, G. Li, Q. Hu, and H. Lu, “Recent advances in efficient computation of deep convolutional neural networks,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 64–77, 2018.
- A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’18. Piscataway, NJ, USA: IEEE Press, 2018, pp. 47:1–47:11.
- IEEE, “Ieee standard for floating-point arithmetic,” IEEE Std 754-2008, pp. 1–70, Aug 2008.
- https://ieeexplore.ieee.org/abstract/document/9644751
- APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores
- https://ieeexplore.ieee.org/abstract/document/9605039
- https://dl.acm.org/doi/abs/10.1145/3404397.3404407
- https://ieeexplore.ieee.org/abstract/document/9370335
- NVIDIA. (2019) Cublas library. [Online]. Available: https://docs.nvidi a.com/cuda/pdf/CUBLAS_Library.pdf [19]
- NVIDIA. (2018) Cutlass: Cuda template library for dense linear algebra at all levels and scales. [Online]. Available: https: //github.com/NVIDIA/cutlass
- NVIDIA. (2022) Cusparse library. [Online]. Available:https://docs.nvidia.com/cuda/pdf/ CUSPARSE_Library.pdf
- M. Naumov, “Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS”, Technical Report and White Paper, 2011. Available:https://developer.download.nvidia.com/assets/cuda/files/psts_white_paper_final.pdf
- S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, pp. 1–14, 2015.
- Differentiable Soft Quantization: Bridging Full Precision and Low Bit Neural Networks
- B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Comp
- A. Haidar, P. Wu, S. Tomov, and J. Dongarra, “Investigating half precision arithmetic to accelerate dense linear system solvers,” in Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ser. ScalA ’17. New York, NY, USA: ACM, 2017, pp. 10:1–10:8.
- Intel. (2021) Central library. [Online] https://www.intel.com/content/dam/www/central-libraries/us /en/documents/2022-06/optimize-inference-cpu-technology-sb.pdf
- Intel. (2019) OpenMKL library.[Online]https:// www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
- Eigen. https://eigen.tuxfamily.org/index.php?title=Main_Page
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.