Resource Efficient Neural Networks Using Hessian Based Pruning (2306.07030v1)
Abstract: Neural network pruning is a practical way for reducing the size of trained models and the number of floating-point operations. One way of pruning is to use the relative Hessian trace to calculate sensitivity of each channel, as compared to the more common magnitude pruning approach. However, the stochastic approach used to estimate the Hessian trace needs to iterate over many times before it can converge. This can be time-consuming when used for larger models with many millions of parameters. To address this problem, we modify the existing approach by estimating the Hessian trace using FP16 precision instead of FP32. We test the modified approach (EHAP) on ResNet-32/ResNet-56/WideResNet-28-8 trained on CIFAR10/CIFAR100 image classification tasks and achieve faster computation of the Hessian trace. Specifically, our modified approach can achieve speed ups ranging from 17% to as much as 44% during our experiments on different combinations of model architectures and GPU devices. Our modified approach also takes up around 40% less GPU memory when pruning ResNet-32 and ResNet-56 models, which allows for a larger Hessian batch size to be used for estimating the Hessian trace. Meanwhile, we also present the results of pruning using both FP16 and FP32 Hessian trace calculation and show that there are no noticeable accuracy differences between the two. Overall, it is a simple and effective way to compute the relative Hessian trace faster without sacrificing on pruned model performance. We also present a full pipeline using EHAP and quantization aware training (QAT), using INT8 QAT to compress the network further after pruning. In particular, we use symmetric quantization for the weights and asymmetric quantization for the activations.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” May 2015. [Online]. Available: https://www.nature.com/articles/nature14539
- M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” CoRR, vol. abs/1905.11946, 2019. [Online]. Available: http://arxiv.org/abs/1905.11946
- X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” CoRR, vol. abs/2106.04560, 2021. [Online]. Available: https://arxiv.org/abs/2106.04560
- V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
- S. Yu, Z. Yao, A. Gholami, Z. Dong, M. W. Mahoney, and K. Keutzer, “Hessian-aware pruning and optimal neural implant,” CoRR, vol. abs/2101.08940, 2021. [Online]. Available: https://arxiv.org/abs/2101.08940
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” CoRR, vol. abs/2005.14165, 2020. [Online]. Available: https://arxiv.org/abs/2005.14165
- J. Guo, J. Liu, and D. Xu, “Jointpruning: Pruning networks along multiple dimensions for efficient point cloud processing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3659–3672, 2022.
- C. Liu, P. Liu, W. Zhao, and X. Tang, “Visual tracking by structurally optimizing pre-trained cnn,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 9, pp. 3153–3166, 2020.
- Y. Peng and J. Qi, “Quintuple-media joint correlation learning with deep compression and regularization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 8, pp. 2709–2722, 2020.
- K. Liu, W. Liu, H. Ma, M. Tan, and C. Gan, “A real-time action representation with temporal encoding and deep compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, pp. 647–660, 2021.
- H.-J. Kang, “Accelerator-aware pruning for convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 2093–2103, 2020.
- Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” 2017.
- H. Kirchhoffer, P. Haase, W. Samek, K. Müller, H. Rezazadegan-Tavakoli, F. Cricri, E. B. Aksu, M. M. Hannuksela, W. Jiang, W. Wang, S. Liu, S. Jain, S. Hamidi-Rad, F. Racapé, and W. Bailer, “Overview of the neural network compression and representation (nnr) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 3203–3216, 2022.
- A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and A. Courville, “Dynamic capacity networks,” in International Conference on Machine Learning, 2016, pp. 2549–2558.
- A. Ashok, N. Rhinehart, F. Beainy, and K. M. Kitani, “N2n learning: Network to network compression via policy gradient reinforcement learning,” 2017.
- F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size,” 2016.
- H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” 2018.
- D. W. Blalock, J. J. G. Ortiz, J. Frankle, and J. V. Guttag, “What is the state of neural network pruning?” CoRR, vol. abs/2003.03033, 2020. [Online]. Available: https://arxiv.org/abs/2003.03033
- A. Kusupati, V. Ramanujan, R. Somani, M. Wortsman, P. Jain, S. Kakade, and A. Farhadi, “Soft threshold weight reparameterization for learnable sparsity,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5544–5555. [Online]. Available: http://proceedings.mlr.press/v119/kusupati20a.html
- X. Ding, G. Ding, J. Han, and S. Tang, “Auto-balanced filter pruning for efficient convolutional neural networks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, Apr. 2018. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/12262
- P. Savarese, H. Silva, and M. Maire, “Winning the lottery with continuous sparsification,” Advances in Neural Information Processing Systems, 2020.
- M. Zhao, J. Peng, S. Yu, L. Liu, and N. Wu, “Exploring structural sparsity in cnn via selective penalty,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1658–1666, 2022.
- H. Wang, C. Qin, Y. Zhang, and Y. Fu, “Neural pruning via growing regularization,” in International Conference on Learning Representations, 2021.
- Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. Morgan-Kaufmann, 1990, pp. 598–605. [Online]. Available: http://papers.nips.cc/paper/250-optimal-brain-damage.pdf
- B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, Eds. Morgan-Kaufmann, 1993, pp. 164–171.
- E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, and D. Alistarh, “The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 4163–4181. [Online]. Available: https://aclanthology.org/2022.emnlp-main.279
- N. Lee, T. Ajanthan, and P. H. Torr, “Snip: Single-shot network pruning based on connection sensitivity,” arXiv preprint arXiv:1810.02340, 2018.
- C. Wang, G. Zhang, and R. Grosse, “Picking winning tickets before training by preserving gradient flow,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkgsACVKPH
- A. Peste, E. Iofinova, A. Vladu, and D. Alistarh, “AC/DC: Alternating compressed/decompressed training of deep neural networks,” in Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online]. Available: https://openreview.net/forum?id=T3_AJr9-R5g
- S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural networks,” Procedings of the British Machine Vision Conference 2015, 2015. [Online]. Available: http://dx.doi.org/10.5244/C.29.31
- T. Lin, S. U. Stich, L. Barba, D. Dmitriev, and M. Jaggi, “Dynamic model pruning with feedback,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SJem8lSFwB
- P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” International Conference on Learning Representations, 2017.
- J. Liu, Z. XU, R. SHI, R. C. C. Cheung, and H. K. So, “Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SJlbGJrtDB
- P. de Jorge, A. Sanyal, H. Behl, P. Torr, G. Rogez, and P. K. Dokania, “Progressive skeletonization: Trimming more fat from a network at initialization,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=9GsFOUyUPi
- J. Guo, W. Zhang, W. Ouyang, and D. Xu, “Model compression using progressive channel pruning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 3, pp. 1114–1124, 2021.
- U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen, “Rigging the lottery: Making all tickets winners,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 2943–2952. [Online]. Available: https://proceedings.mlr.press/v119/evci20a.html
- J. Lee, S. Park, S. Mo, S. Ahn, and J. Shin, “Layer-adaptive sparsity for the magnitude-based pruning,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=H6ATjJ0TKdf
- M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” ICLR Workshop, vol. abs/1710.01878, 2018.
- N. Ström, “Sparse connection and pruning in large dynamic artificial neural networks,” 1997.
- S. Park, J. Lee, S. Mo, and J. Shin, “Lookahead: a far-sighted alternative of magnitude-based pruning,” International Conference on Learning Representations, 2020.
- M. Gupta, E. Camci, V. R. Keneta, A. Vaidyanathan, R. Kanodia, C.-S. Foo, W. Min, and L. Jie, “Is complexity required for neural network pruning? a case study on global magnitude pruning,” 2022.
- M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” ArXiv, vol. abs/1710.01878, 2017.
- Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” CoRR, vol. abs/1708.06519, 2017. [Online]. Available: http://arxiv.org/abs/1708.06519
- Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in NIPS, 1989.
- B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon: Extensions and performance comparisons,” in NIPS 1993, 1993.
- C. Wang, R. B. Grosse, S. Fidler, and G. Zhang, “Eigendamage: Structured pruning in the kronecker-factored eigenbasis,” CoRR, vol. abs/1905.05934, 2019. [Online]. Available: http://arxiv.org/abs/1905.05934
- T. Liang, J. Glossner, L. Wang, and S. Shi, “Pruning and quantization for deep neural network acceleration: A survey,” CoRR, vol. abs/2101.09671, 2021. [Online]. Available: https://arxiv.org/abs/2101.09671
- J. Zhang, Y. Zhou, and R. Saab, “Post-training quantization for neural networks with provable guarantees,” CoRR, vol. abs/2201.11113, 2022. [Online]. Available: https://arxiv.org/abs/2201.11113
- M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” CoRR, vol. abs/2106.08295, 2021. [Online]. Available: https://arxiv.org/abs/2106.08295
- Z. Yao, Z. Dong, Z. Zheng, A. Gholami, J. Yu, E. Tan, L. Wang, Q. Huang, Y. Wang, M. W. Mahoney, and K. Keutzer, “HAWQV3: dyadic neural network quantization,” CoRR, vol. abs/2011.10680, 2020. [Online]. Available: https://arxiv.org/abs/2011.10680
- Z. Bai, M. R. Fahey, and G. H. Golub, “Some large-scale matrix computation problems,” Journal of Computational and Applied Mathematics, vol. 74, pp. 71–89, 1996.
- PyTorch, “Automatic mixed precision package - torch.amp — pytorch 1.13 documentation.”
- A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.html