Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Always-Sparse Training by Growing Connections with Guided Stochastic Exploration (2401.06898v1)

Published 12 Jan 2024 in cs.LG

Abstract: The excessive computational requirements of modern artificial neural networks (ANNs) are posing limitations on the machines that can run them. Sparsification of ANNs is often motivated by time, memory and energy savings only during model inference, yielding no benefits during training. A growing body of work is now focusing on providing the benefits of model sparsification also during training. While these methods greatly improve the training efficiency, the training algorithms yielding the most accurate models still materialize the dense weights, or compute dense gradients during training. We propose an efficient, always-sparse training algorithm with excellent scaling to larger and sparser models, supported by its linear time complexity with respect to the model width during training and inference. Moreover, our guided stochastic exploration algorithm improves over the accuracy of previous sparse training methods. We evaluate our method on CIFAR-10/100 and ImageNet using ResNet, VGG, and ViT models, and compare it against a range of sparsification methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. N. Ahmed and M. Wahed. The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research. arXiv preprint arXiv:2010.15581, 2020.
  2. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29, 1996.
  3. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 10–20, 1999.
  4. The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses. In Proceedings of the conference on high performance computing networking, storage and analysis, pages 1–12, 2009.
  5. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
  6. Deep rewiring: Training very sparse deep networks. In International Conference on Learning Representations, 2018.
  7. Better plain vit baselines for imagenet-1k. arXiv preprint arXiv:2205.01580, 2022.
  8. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. E. Bullmore and O. Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature reviews neuroscience, 10(3):186–198, 2009.
  11. Sparsity winning twice: Better robust generalization from more efficient training. In International Conference on Learning Representations, 2021.
  12. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  13. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  14. T. Dettmers and L. Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, 2019.
  15. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30, 2017.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. P. Erdős and A. Rényi. On random graphs i. Publicationes Mathematicae Debrecen, 6:290, 1959.
  18. The difficulty of training sparse neural networks. arXiv preprint arXiv:1906.10732, 2019.
  19. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pages 2943–2952. PMLR, 2020.
  20. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
  21. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.
  22. M-fac: Efficient matrix-free approximations of second-order information. Advances in Neural Information Processing Systems, 34:14873–14886, 2021.
  23. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  24. Sparsity through evolutionary pruning prevents neuronal networks from overfitting. Neural Networks, 128:305–312, 2020.
  25. Dynamic network surgery for efficient dnns. Advances in neural information processing systems, 29, 2016.
  26. Deep learning with limited numerical precision. In International conference on machine learning, pages 1737–1746. PMLR, 2015.
  27. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  28. B. Hassibi and D. Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992.
  29. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  30. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021.
  31. T. Hwang. Computational power and the social impact of artificial intelligence. arXiv preprint arXiv:1803.08971, 2018.
  32. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
  33. S. A. Janowsky. Pruning versus clipping in neural networks. Physical Review A, 39(12):6600, 1989.
  34. Top-kast: Top-k always sparse training. Advances in Neural Information Processing Systems, 33:20744–20754, 2020.
  35. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018.
  36. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  37. E. D. Karnin. A simple procedure for pruning back-propagation trained neural networks. IEEE transactions on neural networks, 1(2):239–242, 1990.
  38. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.
  39. Learning multiple layers of features from tiny images. 2009.
  40. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  41. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2018.
  42. A signal propagation perspective for pruning neural networks at initialization. In International Conference on Learning Representations, 2019.
  43. Sparse training via boosting pruning plasticity with neuroregeneration. Advances in Neural Information Processing Systems, 34:9908–9922, 2021a.
  44. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning, pages 6989–7000. PMLR, 2021b.
  45. A survey of deep neural network architectures and their applications. Neurocomputing, 234:11–26, 2017.
  46. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.
  47. Learning sparse neural networks through l_0 regularization. In International Conference on Learning Representations, 2018.
  48. E. S. Lubana and R. Dick. A gradient flow framework for analyzing network pruning. In International Conference on Learning Representations, 2020.
  49. Wrpn: Wide reduced-precision networks. In International Conference on Learning Representations, 2018.
  50. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.
  51. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
  52. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019a.
  53. Pruning convolutional neural networks for resource efficient inference. In 5th International Conference on Learning Representations, ICLR 2017-Conference Track Proceedings, 2019b.
  54. H. Mostafa and X. Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, pages 4646–4655. PMLR, 2019.
  55. M. C. Mozer and P. Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Advances in neural information processing systems, 1, 1988.
  56. D. R. Musser. Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8):983–993, 1997.
  57. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015.
  58. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
  59. Tensorizing neural networks. Advances in neural information processing systems, 28, 2015.
  60. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020.
  61. Deep learning vs. traditional computer vision. In Science and information conference, pages 128–144. Springer, 2019.
  62. L. Pessoa. Understanding brain networks and brain organization. Physics of life reviews, 11(3):400–435, 2014.
  63. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  64. R. Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
  65. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  66. Powerpropagation: A sparsity inducing weight reparameterisation. Advances in neural information processing systems, 34:28889–28903, 2021.
  67. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  68. N. Ström. Sparse connection and pruning in large dynamic artificial neural networks. In Fifth European Conference on Speech Communication and Technology. Citeseer, 1997.
  69. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389, 2020.
  70. G. Thimm and E. Fiesler. Evaluating pruning methods. In Proceedings of the International Symposium on Artificial neural networks, pages 20–25, 1995.
  71. M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Transactions on Software Engineering, 17(9):972–975, 1991.
  72. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience, 2018, 2018.
  73. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2019.
  74. Recent advances in deep learning. International Journal of Machine Learning and Cybernetics, 11(4):747–750, 2020.
  75. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998.
  76. Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. In International Conference on Learning Representations, 2019.
  77. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3):55–75, 2018.
  78. Nips: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9194–9203, 2018.
  79. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.
  80. M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mike Heddes (10 papers)
  2. Narayan Srinivasa (4 papers)
  3. Tony Givargis (12 papers)
  4. Alexandru Nicolau (11 papers)