Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

A method for accelerating low precision operations by sparse matrix multiplication (2403.06924v1)

Published 11 Mar 2024 in math.NA and cs.NA

Abstract: In recent years, the fervent demand for computational power across various domains has prompted hardware manufacturers to introduce specialized computing hardware aimed at enhancing computational capabilities. Particularly, the utilization of tensor hardware supporting low precision has gained increasing prominence in scientific research. However, the use of low-precision tensor hardware for computational acceleration often introduces errors, posing a fundamental challenge of simultaneously achieving effective acceleration while maintaining computational accuracy. This paper proposes improvements in the methodology by incorporating low-precision quantization and employing a residual matrix for error correction and combines vector-wise quantization method.. The key innovation lies in the use of sparse matrices instead of dense matrices when compensating for errors with a residual matrix. By focusing solely on values that may significantly impact relative errors under a specified threshold, this approach aims to control quantization errors while reducing computational complexity. Experimental results demonstrate that this method can effectively control the quantization error while maintaining high acceleration effect.The improved algorithm on the CPU can achieve up to 15\% accuracy improvement while 1.46 times speed improvement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 1–436, 2015.
  2. Intel. (2017) Intel nervana neural network processors. [Online]. Available: https://www.intel.ai/nervana-nnp/
  3. NVIDIA. (2017) Nvidia tensor cores. [Online]. Available: https: //www.nvidia.com/en-us/data-center/tensorcore/
  4. NVIDIA. (2020) [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
  5. NVIDIA. (2020) [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
  6. NVIDIA. (2020) [Online]. Available: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
  7. J. Cheng, P. Wang, G. Li, Q. Hu, and H. Lu, “Recent advances in efficient computation of deep convolutional neural networks,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 64–77, 2018.
  8. A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’18. Piscataway, NJ, USA: IEEE Press, 2018, pp. 47:1–47:11.
  9. IEEE, “Ieee standard for floating-point arithmetic,” IEEE Std 754-2008, pp. 1–70, Aug 2008.
  10. https://ieeexplore.ieee.org/abstract/document/9644751
  11. APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores
  12. https://ieeexplore.ieee.org/abstract/document/9605039
  13. https://dl.acm.org/doi/abs/10.1145/3404397.3404407
  14. https://ieeexplore.ieee.org/abstract/document/9370335
  15. NVIDIA. (2019) Cublas library. [Online]. Available: https://docs.nvidi a.com/cuda/pdf/CUBLAS_Library.pdf [19]
  16. NVIDIA. (2018) Cutlass: Cuda template library for dense linear algebra at all levels and scales. [Online]. Available: https: //github.com/NVIDIA/cutlass
  17. NVIDIA. (2022) Cusparse library. [Online]. Available:https://docs.nvidia.com/cuda/pdf/ CUSPARSE_Library.pdf
  18. M. Naumov, “Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS”, Technical Report and White Paper, 2011. Available:https://developer.download.nvidia.com/assets/cuda/files/psts_white_paper_final.pdf
  19. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, pp. 1–14, 2015.
  20. Differentiable Soft Quantization: Bridging Full Precision and Low Bit Neural Networks
  21. B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Comp
  22. A. Haidar, P. Wu, S. Tomov, and J. Dongarra, “Investigating half precision arithmetic to accelerate dense linear system solvers,” in Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ser. ScalA ’17. New York, NY, USA: ACM, 2017, pp. 10:1–10:8.
  23. Intel. (2021) Central library. [Online] https://www.intel.com/content/dam/www/central-libraries/us /en/documents/2022-06/optimize-inference-cpu-technology-sb.pdf
  24. Intel. (2019) OpenMKL library.[Online]https:// www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
  25. Eigen. https://eigen.tuxfamily.org/index.php?title=Main_Page
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.