Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masks, Signs, And Learning Rate Rewinding (2402.19262v1)

Published 29 Feb 2024 in cs.LG

Abstract: Learning Rate Rewinding (LRR) has been established as a strong variant of Iterative Magnitude Pruning (IMP) to find lottery tickets in deep overparameterized neural networks. While both iterative pruning schemes couple structure and parameter learning, understanding how LRR excels in both aspects can bring us closer to the design of more flexible deep learning algorithms that can optimize diverse sets of sparse architectures. To this end, we conduct experiments that disentangle the effect of mask learning and parameter optimization and how both benefit from overparameterization. The ability of LRR to flip parameter signs early and stay robust to sign perturbations seems to make it not only more effective in mask identification but also in optimizing diverse sets of masks, including random ones. In support of this hypothesis, we prove in a simplified single hidden neuron setting that LRR succeeds in more cases than IMP, as it can escape initially problematic sign configurations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pp.  244–253. PMLR, 2018.
  2. A convergence analysis of gradient descent for deep linear neural networks. In International Conference on Learning Representations, 2019.
  3. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 2019.
  4. Gradient flow dynamics of shallow reLU networks for square loss and orthogonal inputs. In Advances in Neural Information Processing Systems, 2022.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  6. On the existence of universal lottery tickets. In International Conference on Learning Representations.
  7. Rebekka Burkolz. Most activation functions can win the lottery without excessive depth. In Advances in Neural Information Processing Systems, 2022.
  8. Provable benefits of overparameterization in model compression: From double descent to pruning neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  6974–6983, 2021.
  9. Proving the lottery ticket hypothesis for convolutional neural networks. In International Conference on Learning Representations, 2022.
  10. Approximation schemes for relu regression. In Conference on learning theory, pp.  1452–1485. PMLR, 2020.
  11. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
  12. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Advances in neural information processing systems, 31, 2018.
  13. Gradient flow in sparse neural networks and how lottery tickets win. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  6577–6586, 2022.
  14. A general framework for proving the equivariant strong lottery ticket hypothesis. In International Conference on Learning Representations, 2023.
  15. Lottery tickets with nonzero biases. arXiv preprint arXiv:2110.11150, 2021.
  16. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
  17. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, 2020a.
  18. The early phase of neural network training. In International Conference on Learning Representations, 2020b.
  19. Training batchnorm and only batchnorm: On the expressive power of random features in {cnn}s. In International Conference on Learning Representations, 2021.
  20. Agnostic learning of a single neuron with gradient descent. Advances in Neural Information Processing Systems, 33:5417–5428, 2020.
  21. Why random pruning is all we need to start sparse. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  10542–10570, 2023.
  22. Are wider nets better given the same number of parameters? In International Conference on Learning Representations, 2020.
  23. Fitting relus via sgd and quantized sgd. In 2019 IEEE international symposium on information theory (ISIT), pp.  2469–2473. IEEE, 2019.
  24. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  25. Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the International Conference on Machine Learning, July 2020.
  26. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  27. Snip: single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
  28. Support vectors and gradient dynamics for implicit bias in relu networks. arXiv preprint arXiv:2202.05510, 2022a.
  29. Magnitude and angle dynamics in training single relu neurons. arXiv preprint arXiv:2209.13394, 2022b.
  30. Lottery ticket preserves weight correlation: Is it desirable or not? In International Conference on Machine Learning, 2021a.
  31. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. In International Conference on Learning Representations, 2021b.
  32. Sanity checks for lottery tickets: Does your winning ticket really win the jackpot? In Advances in Neural Information Processing Systems, 2021.
  33. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, 2020.
  34. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BygfghAcYX.
  35. Logarithmic pruning is all you need. Advances in Neural Information Processing Systems, 33, 2020.
  36. Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning, pp.  4951–4960. PMLR, 2019.
  37. Lottery tickets on a data diet: Finding initializations with sparse trainable networks. Advances in Neural Information Processing Systems, 35:18916–18928, 2022.
  38. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=xSsW2Am-ukZ.
  39. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. In Advances in Neural Information Processing Systems, volume 33, pp.  2599–2610, 2020.
  40. What’s hidden in a randomly weighted neural network? In Conference on Computer Vision and Pattern Recognition, 2020a.
  41. What’s hidden in a randomly weighted neural network? In Computer Vision and Pattern Recognition, pp.  11893–11902, 2020b.
  42. Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020.
  43. K Simonyan and A Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society, 2015.
  44. Mahdi Soltanolkotabi. Learning relus via gradient descent. Advances in neural information processing systems, 30, 2017.
  45. Sanity-checking pruning methods: Random tickets can win the jackpot. In Advances in Neural Information Processing Systems, 2020.
  46. Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval. arXiv preprint arXiv:1910.12837, 2019.
  47. Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, 2020.
  48. Learning a single neuron with bias using gradient descent. Advances in Neural Information Processing Systems, 34:28690–28700, 2021.
  49. Polarity is all you need to learn and transfer faster. 2023.
  50. Learning a single neuron with gradient methods. In Conference on Learning Theory, pp.  3756–3786. PMLR, 2020.
  51. Why lottery ticket wins? a theoretical perspective of sample complexity on sparse neural networks. In Advances in Neural Information Processing Systems, 2021.
  52. Deconstructing lottery tickets: Zeros, signs, and the supermask. In Advances in Neural Information Processing Systems, pp.  3597–3607, 2019.
Citations (5)

Summary

We haven't generated a summary for this paper yet.