Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 78 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning (2311.16883v2)

Published 28 Nov 2023 in cs.LG and cs.PF

Abstract: The rise of Deep Neural Networks (DNNs) has led to an increase in model size and complexity, straining the memory capacity of GPUs. Sparsity in DNNs, characterized as structural or ephemeral, has gained attention as a solution. This work focuses on ephemeral sparsity, aiming to reduce memory consumption during training. It emphasizes the significance of activations, an often overlooked component, and their role in memory usage. This work employs structured pruning in Block Sparse Compressed Row (BSR) format in combination with a magnitude-based criterion to efficiently prune activations. We furthermore introduce efficient block-sparse operators for GPUs and showcase their effectiveness, as well as the superior compression offered by block sparsity. We report the effectiveness of activation pruning by evaluating training speed, accuracy, and memory usage of large-scale neural architectures on the example of ResMLP on image classification tasks. As a result, we observe a memory reduction of up to 32% while maintaining accuracy. Ultimately, our approach aims to democratize large-scale model training, reduce GPU requirements, and address ecological concerns.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. J. Lee, L. Mukhanov, A. S. Molahosseini, U. Minhas, Y. Hua, J. Martinez del Rincon, K. Dichev, C.-H. Hong, and H. Vandierendonck, “Resource-efficient convolutional networks: A survey on model-, arithmetic-, and implementation-level techniques,” ACM Comput. Surv., vol. 55, no. 13s, Jul 2023. [Online]. Available: https://doi.org/10.1145/3587095
  2. W. Roth, G. Schindler, M. Zöhrer, L. Pfeifenberger, R. Peharz, S. Tschiatschek, H. Fröning, F. Pernkopf, and Z. Ghahramani, “Resource-efficient neural networks for embedded systems,” CoRR, vol. abs/2001.03048, 2020. [Online]. Available: http://arxiv.org/abs/2001.03048
  3. Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” CoRR, vol. abs/1810.05270, 2018. [Online]. Available: http://arxiv.org/abs/1810.05270
  4. L. Prechelt, “Connection pruning with static and adaptive pruning schedules,” vol. 16, no. 1, pp. 49–61. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231296000549
  5. G. Schindler, W. Roth, F. Pernkopf, and H. Froening, “Parameterized Structured Pruning for Deep Neural Networks,” CoRR, vol. abs/1906.05180. [Online]. Available: http://arxiv.org/abs/1906.05180
  6. H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, and H. Jégou, “ResMLP: Feedforward networks for image classification with data-efficient training,” CoRR, vol. abs/2105.03404. [Online]. Available: http://arxiv.org/abs/2105.03404
  7. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” CoRR, vol. abs/2010.11929. [Online]. Available: http://arxiv.org/abs/2010.11929
  8. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580. [Online]. Available: http://arxiv.org/abs/1207.0580
  9. G. Ghiasi, T.-Y. Lin, and Q. V. Le, “DropBlock: A regularization method for convolutional networks,” CoRR, vol. abs/1810.12890. [Online]. Available: http://arxiv.org/abs/1810.12890
  10. Z. Zhang, P. Yang, X. Ren, Q. Su, and X. Sun, “Memorized sparse backpropagation,” vol. 415, pp. 397–407. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231220313357
  11. J. Liu, Z. Xu, R. Shi, R. C. C. Cheung, and H. K. H. So, “Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers,” CoRR, vol. abs/2005.06870. [Online]. Available: http://arxiv.org/abs/2005.06870
  12. M. A. Raihan and T. Aamodt, “Sparse Weight Activation Training,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., pp. 15 625–15 638. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/b44182379bf9fae976e6ae5996e13cd8-Abstract.html
  13. H. Borras, B. Klein, and H. Fröning, “Walking noise: Understanding implications of noisy computations on classification tasks,” CoRR, vol. abs/2212.10430, Dec 2022. [Online]. Available: https://arxiv.org/abs/2212.10430
  14. T. Krieger, B. Klein, and H. Fröning, “Towards hardware-specific automatic compression of neural networks,” CoRR, vol. abs/2212.07818, Dec 2022. [Online]. Available: https://arxiv.org/abs/2212.07818
  15. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, Feb. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/3446776
  16. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” CoRR, vol. abs/2001.08361. [Online]. Available: http://arxiv.org/abs/2001.08361
  17. Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers,” in Proceedings of the 37th International Conference on Machine Learning.   PMLR, pp. 5958–5968. [Online]. Available: https://proceedings.mlr.press/v119/li20m.html
  18. P. Tillet, H. T. Kung, and D. Cox, “Triton: An intermediate language and compiler for tiled neural network computations,” in Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages - MAPL 2019.   ACM Press, pp. 10–19. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3315508.3329973
  19. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” CoRR, vol. abs/1610.02132, 2017. [Online]. Available: http://arxiv.org/abs/1610.02132
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube