Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization (2402.01089v2)

Published 2 Feb 2024 in stat.ML and cs.LG

Abstract: The existence of "lottery tickets" arXiv:1803.03635 at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model ("pruning at initialization") have been broadly unsuccessful arXiv:2009.08576. We put forward a theoretical explanation for this, based on the model's effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of arXiv:2105.12806 extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Prospect pruning: Finding trainable weights at initialization using meta-gradients. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=AIgn9uwfcD1.
  2. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pp. 242–252. PMLR, 2019.
  3. Chaining mutual information and tightening generalization bounds. Advances in Neural Information Processing Systems, 31, 2018.
  4. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  5. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  6. Beyond the universal law of robustness: Sharper laws for random features and neural tangent kernels. Proceedings of the 40th International Conference on Machine Learning, 2023.
  7. Tightening mutual information-based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020.
  8. A universal law of robustness via isoperimetry. Journal of the ACM, 70(2):1–18, 2023.
  9. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1), May 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-23103-1. URL http://dx.doi.org/10.1038/s41467-021-23103-1.
  10. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48(1):27–42, 2020.
  11. The Lottery Ticket Hypothesis for pre-trained BERT networks. Advances in neural information processing systems, 33:15834–15846, 2020.
  12. Thomas M Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, (3):326–334, 1965.
  13. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
  14. Progressive skeletonization: Trimming more fat from a network at initialization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=9GsFOUyUPi.
  15. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
  16. Algorithmic pure states for the negative spherical perceptron. Journal of Statistical Physics, 189(2):27, 2022.
  17. The difficulty of training sparse neural networks. arXiv preprint arXiv:1906.10732, 2019.
  18. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp.  954–959, 2020.
  19. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
  20. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.
  21. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020a.
  22. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576, 2020b.
  23. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  24. Demystifying fixed k𝑘kitalic_k-nearest neighbor information estimators. IEEE Transactions on Information Theory, 64(8):5629–5661, 2018.
  25. Elizabeth Gardner. The space of interactions in neural network models. Journal of physics A: Mathematical and general, 21(1):257, 1988.
  26. Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E, 100(1):012115, 2019.
  27. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
  28. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  29. Eie: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 44(3):243–254, 2016.
  30. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992.
  31. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  32. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pp.  784–800, 2018.
  33. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  34. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  35. Implicit regularization of random feature models. In International Conference on Machine Learning, pp. 4631–4640. PMLR, 2020.
  36. Estimating mutual information. Physical review E, 69(6):066138, 2004.
  37. Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), 1990.
  38. Probability in Banach Spaces: Isoperimetry and Processes. Springer Science & Business Media, 2013.
  39. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
  40. Network in network. arXiv preprint arXiv:1312.4400, 2013.
  41. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
  42. On the computational efficiency of training neural networks. Advances in neural information processing systems, 27, 2014.
  43. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
  44. Six lectures on linearized neural networks. arXiv preprint arXiv:2308.13431, 2023.
  45. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
  46. Tractability from overparametrization: The example of the negative perceptron. Probability Theory and Related Fields, pp.  1–106, 2024.
  47. Behnam Neyshabur. Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953, 2017.
  48. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? arXiv preprint arXiv:2210.03044, 2022.
  49. Understanding pruning at initialization: An effective node-path balancing perspective. 2022.
  50. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11893–11902, 2020.
  51. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
  52. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  53. Rigorous solution of the gardner problem. Communications in mathematical physics, 234:383–422, 2003.
  54. More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. arXiv preprint arXiv:2311.14646, 2023.
  55. Rare gems: Finding lottery tickets at initialization. Advances in Neural Information Processing Systems, 35:14529–14540, 2022.
  56. Mihailo Stojnic. Another look at the gardner problem. arXiv preprint arXiv:1306.3979, 2013.
  57. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
  58. Michel Talagrand. Intersecting random half-spaces: toward the Gardner-Derrida formula. The Annals of Probability, 28(2):725–758, 2000.
  59. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems, 33:6377–6389, 2020.
  60. Ramon van Handel. Probability in high dimensions. 2016. URL https://web.math.princeton.edu/~rvan/APC550.pdf.
  61. Roman Vershynin. High-dimensional probability : an introduction with applications in data science. Cambridge series on statistical and probabilistic mathematics ; 47. Cambridge University Press, Cambridge, United Kingdom ; New York, NY, 2018. ISBN 9781108415194.
  62. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020.
  63. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
  64. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017.
  65. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  66. Sparch: Efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp.  261–274. IEEE, 2020.
  67. Deconstructing lottery tickets: Zeros, signs, and the supermask. Advances in neural information processing systems, 32, 2019.
  68. To prune, or not to prune: Exploring the efficacy of pruning for model compression, 2018. URL https://openreview.net/forum?id=S1lN69AT-.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tanishq Kumar (6 papers)
  2. Kevin Luo (5 papers)
  3. Mark Sellke (57 papers)
Citations (1)