Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs (2407.20496v1)

Published 30 Jul 2024 in cs.LG and cs.AI

Abstract: N:M sparsity pruning is a powerful technique for compressing deep neural networks, utilizing NVIDIA's Sparse Tensor Core technology. This method benefits from hardware support for sparse indexing, enabling the adoption of fine-grained sparsity to maintain model accuracy while minimizing the overhead typically associated with irregular data access. Although restricted to a fixed level of sparsity due to its reliance on hardware, N:M sparsity can be combined with coarser sparsity techniques to achieve diverse compression ratios. Initially, column-wise vector sparsity is applied to a dense model, followed by row-wise N:M sparsity on the preserved column vectors. We call this multi-level approach as hierarchical N:M (HiNM) sparsity. Similar to earlier single-level sparsity techniques, HiNM sparsity necessitates an effective channel permutation strategy to maximize the accuracy of the compressed networks. However, it introduces further complexities by requiring the rearrangement of both input and output channels, addressing challenges such as permutation sequence, HiNM-sparsity-aware permutation, and maintaining consistency in channel ordering across layers. In this paper, we introduce a channel permutation method designed specifically for HiNM sparsity, named gyro-permutation. This method is crafted to exploit the unique characteristics of HiNM pruning, incorporating a strategic policy in each permutation phase, including channel sampling, clustering, and assignment, to circumvent local minima. Additionally, we have developed a GPU kernel that facilitates independent layer permutation during the execution of HiNM sparse networks. Our extensive experimental evaluations on various DNN models demonstrate that our gyro-permutation significantly enhances the accuracy of HiNM sparse networks, allowing them to reach performance levels comparable to those of unstructured sparse networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
  2. S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” Advances in neural information processing systems, vol. 28, 2015.
  3. Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1389–1397.
  4. Y. Tan, K. Han, K. Zhao, X. Yu, Z. Du, Y. Chen, Y. Wang, and J. Yao, “Accelerating sparse convolution with column vector-wise sparsity,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 307–30 317, 2022.
  5. H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” in International Conference on Learning Representations, 2016.
  6. A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021.
  7. G. Huang, H. Li, M. Qin, F. Sun, Y. Ding, and Y. Xie, “Shfl-bw: accelerating deep neural network inference with tensor-core aware weight pruning,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 1153–1158.
  8. J. Pool and C. Yu, “Channel permutations for n: m sparsity,” Advances in neural information processing systems, vol. 34, pp. 13 316–13 327, 2021.
  9. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations, 2016.
  10. V. Sanh, T. Wolf, and A. Rush, “Movement pruning: Adaptive sparsity by fine-tuning,” Advances in neural information processing systems, vol. 33, pp. 20 378–20 389, 2020.
  11. E. Frantar, E. Kurtic, and D. Alistarh, “M-fac: Efficient matrix-free approximations of second-order information,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 873–14 886, 2021.
  12. E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, and D. Alistarh, “The optimal bert surgeon: Scalable and accurate second-order pruning for large language models,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 4163–4181.
  13. N. Lee, T. Ajanthan, and P. Torr, “SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY,” in International Conference on Learning Representations, 2019.
  14. C. Park, M. Park, H. J. Oh, M. Kim, M. K. Yoon, S. Kim, and W. W. Ro, “Balanced column-wise block pruning for maximizing gpu parallelism,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 8, 2023, pp. 9398–9407.
  15. A. Zhou, Y. Ma, J. Zhu, J. Liu, Z. Zhang, K. Yuan, W. Sun, and H. Li, “Learning n: M fine-grained structured sparse neural networks from scratch,” in International Conference on Learning Representations, 2021.
  16. I. Hubara, B. Chmiel, M. Island, R. Banner, J. Naor, and D. Soudry, “Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks,” Advances in neural information processing systems, vol. 34, pp. 21 099–21 111, 2021.
  17. W. Sun, A. Zhou, S. Stuijk, R. Wijnhoven, A. O. Nelson, H. Corporaal et al., “Dominosearch: Find layer-wise fine-grained n: M sparse schemes from dense neural networks,” Advances in neural information processing systems, vol. 34, pp. 20 721–20 732, 2021.
  18. W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal, “Dissecting tensor cores via microbenchmarks: Latency, throughput and numeric behaviors,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 246–261, 2022.
  19. Y. N. Wu, P.-A. Tsai, S. Muralidharan, A. Parashar, V. Sze, and J. Emer, “Highlight: Efficient and flexible dnn acceleration with hierarchical structured sparsity,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 1106–1120.
  20. J. Liu, G. Dai, H. Xia, L. Guo, X. Shi, J. Xu, H. Yang, and Y. Wang, “Tstc: Two-level sparsity tensor core enabling both algorithm flexibility and hardware efficiency,” in 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD).   IEEE, 2023, pp. 1–9.
  21. R. L. Castro, A. Ivanov, D. Andrade, T. Ben-Nun, B. B. Fraguela, and T. Hoefler, “Venom: A vectorized n: M format for unleashing the power of sparse tensor cores,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14.
  22. Y. Ji, L. Liang, L. Deng, Y. Zhang, Y. Zhang, and Y. Xie, “Tetris: Tile-matching the tremendous irregular sparsity,” Advances in neural information processing systems, vol. 31, 2018.
  23. Y. LeCun, J. Denker, and S. Solla, “Optimal brain damage,” Advances in neural information processing systems, vol. 2, 1989.
  24. S. Marcel and Y. Rodriguez, “Torchvision the machine-vision package of torch,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1485–1488.
  25. D. Kuznedelev, E. Kurtić, E. Frantar, and D. Alistarh, “Cap: Correlation-aware pruning for highly-accurate sparse vision models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Seungmin Yu (1 paper)
  2. Xiaodie Yi (1 paper)
  3. Hayun Lee (3 papers)
  4. Dongkun Shin (4 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets