Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning (2311.16883v2)
Abstract: The rise of Deep Neural Networks (DNNs) has led to an increase in model size and complexity, straining the memory capacity of GPUs. Sparsity in DNNs, characterized as structural or ephemeral, has gained attention as a solution. This work focuses on ephemeral sparsity, aiming to reduce memory consumption during training. It emphasizes the significance of activations, an often overlooked component, and their role in memory usage. This work employs structured pruning in Block Sparse Compressed Row (BSR) format in combination with a magnitude-based criterion to efficiently prune activations. We furthermore introduce efficient block-sparse operators for GPUs and showcase their effectiveness, as well as the superior compression offered by block sparsity. We report the effectiveness of activation pruning by evaluating training speed, accuracy, and memory usage of large-scale neural architectures on the example of ResMLP on image classification tasks. As a result, we observe a memory reduction of up to 32% while maintaining accuracy. Ultimately, our approach aims to democratize large-scale model training, reduce GPU requirements, and address ecological concerns.
- J. Lee, L. Mukhanov, A. S. Molahosseini, U. Minhas, Y. Hua, J. Martinez del Rincon, K. Dichev, C.-H. Hong, and H. Vandierendonck, “Resource-efficient convolutional networks: A survey on model-, arithmetic-, and implementation-level techniques,” ACM Comput. Surv., vol. 55, no. 13s, Jul 2023. [Online]. Available: https://doi.org/10.1145/3587095
- W. Roth, G. Schindler, M. Zöhrer, L. Pfeifenberger, R. Peharz, S. Tschiatschek, H. Fröning, F. Pernkopf, and Z. Ghahramani, “Resource-efficient neural networks for embedded systems,” CoRR, vol. abs/2001.03048, 2020. [Online]. Available: http://arxiv.org/abs/2001.03048
- Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” CoRR, vol. abs/1810.05270, 2018. [Online]. Available: http://arxiv.org/abs/1810.05270
- L. Prechelt, “Connection pruning with static and adaptive pruning schedules,” vol. 16, no. 1, pp. 49–61. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231296000549
- G. Schindler, W. Roth, F. Pernkopf, and H. Froening, “Parameterized Structured Pruning for Deep Neural Networks,” CoRR, vol. abs/1906.05180. [Online]. Available: http://arxiv.org/abs/1906.05180
- H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, and H. Jégou, “ResMLP: Feedforward networks for image classification with data-efficient training,” CoRR, vol. abs/2105.03404. [Online]. Available: http://arxiv.org/abs/2105.03404
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” CoRR, vol. abs/2010.11929. [Online]. Available: http://arxiv.org/abs/2010.11929
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580. [Online]. Available: http://arxiv.org/abs/1207.0580
- G. Ghiasi, T.-Y. Lin, and Q. V. Le, “DropBlock: A regularization method for convolutional networks,” CoRR, vol. abs/1810.12890. [Online]. Available: http://arxiv.org/abs/1810.12890
- Z. Zhang, P. Yang, X. Ren, Q. Su, and X. Sun, “Memorized sparse backpropagation,” vol. 415, pp. 397–407. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231220313357
- J. Liu, Z. Xu, R. Shi, R. C. C. Cheung, and H. K. H. So, “Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers,” CoRR, vol. abs/2005.06870. [Online]. Available: http://arxiv.org/abs/2005.06870
- M. A. Raihan and T. Aamodt, “Sparse Weight Activation Training,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., pp. 15 625–15 638. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/b44182379bf9fae976e6ae5996e13cd8-Abstract.html
- H. Borras, B. Klein, and H. Fröning, “Walking noise: Understanding implications of noisy computations on classification tasks,” CoRR, vol. abs/2212.10430, Dec 2022. [Online]. Available: https://arxiv.org/abs/2212.10430
- T. Krieger, B. Klein, and H. Fröning, “Towards hardware-specific automatic compression of neural networks,” CoRR, vol. abs/2212.07818, Dec 2022. [Online]. Available: https://arxiv.org/abs/2212.07818
- C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, Feb. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/3446776
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” CoRR, vol. abs/2001.08361. [Online]. Available: http://arxiv.org/abs/2001.08361
- Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers,” in Proceedings of the 37th International Conference on Machine Learning. PMLR, pp. 5958–5968. [Online]. Available: https://proceedings.mlr.press/v119/li20m.html
- P. Tillet, H. T. Kung, and D. Cox, “Triton: An intermediate language and compiler for tiled neural network computations,” in Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages - MAPL 2019. ACM Press, pp. 10–19. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3315508.3329973
- D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” CoRR, vol. abs/1610.02132, 2017. [Online]. Available: http://arxiv.org/abs/1610.02132
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.