Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores (2310.02065v1)

Published 3 Oct 2023 in cs.DC and cs.LG

Abstract: The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Language models are unsupervised multitask learners.
  2. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  3. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
  4. Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM Routine on Ampere GPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Chicago, Illinois) (PACT ’22). Association for Computing Machinery, New York, NY, USA, 135–147. https://doi.org/10.1145/3559009.3569691
  5. Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, NY, USA, Article 78, 14 pages. https://doi.org/10.1145/3458817.3476182
  6. Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP ’23). Association for Computing Machinery, New York, NY, USA, 369–379. https://doi.org/10.1145/3572848.3577500
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  8. Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774 [cs.LG]
  9. M-FAC: Efficient Matrix-Free Approximations of Second-Order Information. arXiv:2107.03356 [cs.LG]
  10. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019).
  11. Sparse GPU Kernels for Deep Learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41405.2020.00021
  12. Google Research. 2020. Deep Learning Matrix Collection. Retrieved March 26, 2023 from https://github.com/google-research/google-research/tree/master/sgk
  13. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [cs.CV]
  14. Deep Learning Scaling is Predictable, Empirically. arXiv:1712.00409 [cs.LG]
  15. Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks. J. Mach. Learn. Res. 22, 1, Article 241 (jan 2021), 124 pages.
  16. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In Proceedings of Machine Learning and Systems 3 (MLSys 2021).
  17. STen: An Interface for Efficient Sparsity in PyTorch. https://github.com/spcl/sten. In Sparsity in Neural Networks workshop.
  18. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]
  19. BERT Busters: Outlier Dimensions that Disrupt Transformers. arXiv:2105.06990 [cs.CL]
  20. Eldar Kurtic and Dan Alistarh. 2022. GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods. arXiv:2210.06384 [cs.CL]
  21. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models. https://doi.org/10.48550/ARXIV.2203.07259
  22. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838 (2021).
  23. Optimal Brain Damage. In NIPS.
  24. Flexible group-level pruning of deep neural networks for on-device machine learning. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 79–84.
  25. Efficient Quantized Sparse Matrix Operations on Tensor Cores. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Dallas, Texas) (SC ’22). IEEE Press, Article 37, 15 pages.
  26. Are Sixteen Heads Really Better than One?. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf
  27. Accelerating Sparse Deep Neural Networks. arXiv:2104.08378 [cs.LG]
  28. NVIDIA. 2020. Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt. https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/
  29. NVIDIA. 2023. The cuSparse Library. https://docs.nvidia.com/cuda/cusparse/index.html. Accessed: 2023-04-03.
  30. Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (Portland, Oregon, USA) (SC ’99). Association for Computing Machinery, New York, NY, USA, 30–es. https://doi.org/10.1145/331532.331562
  31. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems 33 (2020), 20378–20389.
  32. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3645–3650. https://doi.org/10.18653/v1/P19-1355
  33. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors. IEEE Transactions on Parallel and Distributed Systems 34, 1 (jan 2023), 246–261. https://doi.org/10.1109/tpds.2022.3217824
  34. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  35. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv:1905.09418 [cs.CL]
  36. Structured pruning of large language models. arXiv preprint arXiv:1910.04732 (2019).
  37. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 634–643. https://doi.org/10.1109/IPDPS47924.2020.00071
  38. PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 26809–26823. https://proceedings.mlr.press/v162/zhang22ao.html
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Roberto L. Castro (7 papers)
  2. Andrei Ivanov (17 papers)
  3. Diego Andrade (3 papers)
  4. Tal Ben-Nun (53 papers)
  5. Basilio B. Fraguela (1 paper)
  6. Torsten Hoefler (203 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com