Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
32 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
468 tokens/sec
Kimi K2 via Groq Premium
202 tokens/sec
2000 character limit reached

LRAMM -- Low precision approximates GEMM via RSVD (2405.16917v1)

Published 27 May 2024 in math.NA, cs.NA, and cs.PF

Abstract: Matrix multiplication computation acceleration has been a research hotspot across various domains. Due to the characteristics of some applications, approximate matrix multiplication can achieve significant performance improvements without losing much precision. In this paper, we propose LRAMM - a high-performance matrix multiplication approximation algorithm that combines mixed-precision quantized matrix multiplication with RSVD techniques, further enhancing efficiency within the error range of low-precision matrix multiplication by utilizing matrix low-rank decomposition technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  2. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  3. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  4. D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,” in Proceedings of the nineteenth annual ACM symposium on Theory of computing, 1987, pp. 1–6.
  5. F. Le Gall, “Powers of tensors and fast matrix multiplication,” in Proceedings of the 39th international symposium on symbolic and algebraic computation, 2014, pp. 296–303.
  6. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12.
  7. B. Hickmann, J. Chen, M. Rotzin, A. Yang, M. Urbanski, and S. Avancha, “Intel nervana neural network processor-t (nnp-t) fused floating point many-term dot product,” in 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH).   IEEE, 2020, pp. 133–136.
  8. A. Boutros, E. Nurvitadhi, R. Ma, S. Gribok, Z. Zhao, J. C. Hoe, V. Betz, and M. Langhammer, “Beyond peak performance: Comparing the real performance of ai-optimized fpgas and gpus,” in 2020 international conference on field-programmable technology (ICFPT).   IEEE, 2020, pp. 10–19.
  9. J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,” IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021.
  10. R. Spring and A. Shrivastava, “Scalable and sustainable deep learning via randomized hashing,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 445–454.
  11. B. Chen, T. Medini, J. Farwell, C. Tai, A. Shrivastava et al., “Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems,” Proceedings of Machine Learning and Systems, vol. 2, pp. 291–306, 2020.
  12. D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” in International Conference on Machine Learning.   PMLR, 2021, pp. 992–1004.
  13. S. G. Lingala, Y. Hu, E. DiBella, and M. Jacob, “Accelerated dynamic mri exploiting sparsity and low-rank structure: k-t slr,” IEEE Transactions on Medical Imaging, p. 1042–1054, May 2011. [Online]. Available: http://dx.doi.org/10.1109/tmi.2010.2100850
  14. A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic dimensionality explains the effectiveness of language model fine-tuning,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Jan 2021. [Online]. Available: http://dx.doi.org/10.18653/v1/2021.acl-long.568
  15. R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, J. Ji, G. Cao, D. Jiang, and M. Zhou, “K-adapter: Infusing knowledge into pre-trained models with adapters,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Jan 2021. [Online]. Available: http://dx.doi.org/10.18653/v1/2021.findings-acl.121
  16. Y. Idelbayev and M. A. Carreira-Perpinan, “Low-rank compression of neural nets: Learning the rank of each layer,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020. [Online]. Available: http://dx.doi.org/10.1109/cvpr42600.2020.00807
  17. K. Osawa, A. Sekiya, H. Naganuma, and R. Yokota, “Accelerating matrix multiplication in deep learning by using low-rank approximation,” in 2017 International Conference on High Performance Computing amp; Simulation (HPCS), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/hpcs.2017.37
  18. P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices i: Approximating matrix multiplication *,” SIAM Journal on Computing, p. 132–157, Jan 2006. [Online]. Available: http://dx.doi.org/10.1137/s0097539704442684
  19. Y. Yang and J. Rao, “Robust and efficient harmonics denoising in large dataset based on random svd and soft thresholding,” Ieee Access, vol. 7, pp. 77 607–77 617, 2019.
  20. H. Ji, W. Yu, and Y. Li, “A rank revealing randomized singular value decomposition (r3svd) algorithm for low-rank matrix approximations,” arXiv preprint arXiv:1605.08134, 2016.
  21. P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix,” SIAM Journal on computing, vol. 36, no. 1, pp. 158–183, 2006.
  22. S. Eriksson-Bique, M. Solbrig, M. Stefanelli, S. Warkentin, R. Abbey, and I. C. Ipsen, “Importance sampling for a monte carlo matrix multiplication algorithm, with application to information retrieval,” SIAM Journal on Scientific Computing, vol. 33, no. 4, pp. 1689–1706, 2011.
  23. W. H. Equitz, “A new vector quantization clustering algorithm,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 10, pp. 1568–1575, 1989.
  24. S. Dai, R. Venkatesan, M. Ren, B. Zimmer, W. Dally, and B. Khailany, “Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference,” Proceedings of Machine Learning and Systems, vol. 3, pp. 873–884, 2021.
  25. N. J. Higham, S. Pranesh, and M. Zounon, “Squeezing a matrix into half precision, with an application to solving linear systems,” SIAM journal on scientific computing, vol. 41, no. 4, pp. A2536–A2551, 2019.
  26. A. Adler, J. Tang, and Y. Polyanskiy, “Quantization of random distributions under kl divergence,” in 2021 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2021, pp. 2762–2767.
  27. J. C. S. de Souza, T. M. L. Assis, and B. C. Pal, “Data compression in smart distribution systems via singular value decomposition,” IEEE transactions on smart grid, vol. 8, no. 1, pp. 275–284, 2015.
  28. F. Anowar, S. Sadaoui, and B. Selim, “Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne),” Computer Science Review, vol. 40, p. 100378, 2021.
  29. J. S. Paul, M. R. Reddy, and V. J. Kumar, “A transform domain svd filter for suppression of muscle noise artefacts in exercise ecg’s,” IEEE Transactions on Biomedical Engineering, vol. 47, no. 5, pp. 654–663, 2000.
  30. X. Zhou, C. Yang, H. Zhao, and W. Yu, “Low-rank modeling and its applications in image analysis,” ACM Computing Surveys (CSUR), vol. 47, no. 2, pp. 1–33, 2014.
  31. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International conference on machine learning.   PMLR, 2015, pp. 1737–1746.
  32. K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional neural networks for document processing,” in Tenth international workshop on frontiers in handwriting recognition.   Suvisoft, 2006.
  33. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
  34. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  35. N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
  36. R. Saha, V. Srivastava, and M. Pilanci, “Matrix compression via randomized low rank and low precision factorization,” Advances in Neural Information Processing Systems, vol. 36, 2023.
  37. NVIDIA. (2017) NVIDIA tensor cores. [Online]. Available: https: //www.nvidia.com/en-us/data-center/tensorcore/
  38. (2020) [Online]. Available: https://images.nvidia .com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
  39. NVIDIA. (2020) [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
  40. NVIDIA. (2020) [Online]. Available: https://images.nvidia. com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
  41. https://arxiv.org/abs/1502.05366

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)

X Twitter Logo Streamline Icon: https://streamlinehq.com