Where Do Large Learning Rates Lead Us? (2410.22113v1)
Abstract: It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.
- Dissecting the high-frequency bias in convolutional neural networks. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 863–871, 2021.
- Learning threshold neurons via edge of stability. Advances in Neural Information Processing Systems, 36, 2024.
- Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023.
- A modern look at the relationship between sharpness and generalization. In International Conference on Machine Learning, pages 840–902. PMLR, 2023a.
- Why do we need weight decay in modern deep learning? In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023b.
- Sgd with large step sizes learns sparse features. In International Conference on Machine Learning, pages 903–925. PMLR, 2023c.
- Theoretical analysis of auto rate-tuning by batch normalization. In International Conference on Learning Representations, 2019.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Implicit gradient regularization. In International Conference on Learning Representations, 2021.
- Mikhail Belkin. The necessity of machine learning theory in mitigating ai risk. ACM / IMS J. Data Sci., jan 2024. doi: 10.1145/3643694. Just Accepted.
- Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012.
- Stochastic collapse: How gradient noise attracts SGD dynamics towards simpler subnetworks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021.
- Constructor Research Platform. URL https://constructor.tech/products/research-platform.
- Randaugment: Practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems, volume 33, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations, 2022.
- What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
- Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In The Eleventh International Conference on Learning Representations, 2022.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Hidden symmetries of ReLU networks. In International Conference on Machine Learning (ICML), pages 11734–11760. PMLR, 2023.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition, 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
- Wide-minima density hypothesis and the explore-exploit learning rate schedule. arXiv preprint arXiv:2003.03977, 2020.
- Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 876–885, 2018.
- Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
- On the relation between the sharpest directions of DNN loss and the SGD step length. In International Conference on Learning Representations, 2019.
- The break-even point on optimization trajectories of deep neural networks. In International Conference on Learning Representations, 2020.
- Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772–4784. PMLR, 2021.
- On the maximum hessian eigenvalue and generalization. In Proceedings on "I Can’t Believe It’s Not Better! - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, 2023.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Last layer re-training is sufficient for robustness to spurious correlations. In International Conference on Machine Learning, 2022.
- An alternative view: When does sgd escape local minima? In International conference on machine learning, pages 2698–2707. PMLR, 2018.
- Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35:14058–14070, 2022.
- HPC resources of the higher school of economics. Journal of Physics: Conference Series, 1740:012050, 2021.
- Learning multiple layers of features from tiny images. 2009.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
- The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. Advances in Neural Information Processing Systems, 33, 2020.
- On the periodic behavior of neural network training with batch normalization and weight decay. In Advances in Neural Information Processing Systems, 2021.
- Benign oscillation of stochastic gradient descent with large learning rates. arXiv preprint arXiv:2310.17074, 2023.
- Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations (ICLR), 2020.
- On linear stability of sgd and input-smoothness of neural networks. Advances in Neural Information Processing Systems, 34:16805–16817, 2021.
- Mixed precision training. In International Conference on Learning Representations (ICLR), 2018.
- Special properties of gradient descent with large learning rates. In International Conference on Machine Learning, pages 25082–25104. PMLR, 2023.
- Benchopt: Reproducible, efficient and collaborative optimization benchmarks. Advances in Neural Information Processing Systems, 35:25404–25421, 2022.
- The implicit bias of minima stability: A view from function space. Advances in Neural Information Processing Systems, 34:17749–17761, 2021.
- Implicit bias of the step size in linear diagonal neural networks. In International Conference on Machine Learning, pages 16270–16295. PMLR, 2022.
- The implicit bias of minima stability in multivariate shallow ReLU networks. In The Eleventh International Conference on Learning Representations, 2023.
- Loss function dynamics and landscape for deep neural networks trained with quadratic loss. In Doklady Mathematics, volume 106, pages S43–S62. Springer, 2022.
- How do vision transformers work? In International Conference on Learning Representations (ICLR), 2022.
- Stable minima cannot overfit in univariate ReLU networks: Generalization by large step sizes. arXiv preprint arXiv:2406.06838, 2024.
- Understanding the generalization benefits of late learning rate decay. In International Conference on Artificial Intelligence and Statistics, 2024.
- Outliers with opposing signals have an outsized effect on neural network optimization. In The Twelfth International Conference on Learning Representations, 2024.
- On learning rates and schrödinger operators. Journal of Machine Learning Research, 24(379):1–53, 2023.
- On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2021.
- Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations, 2022.
- The implicit regularization of dynamical stability in stochastic gradient descent. In International Conference on Machine Learning, pages 37656–37684. PMLR, 2023.
- How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
- The alignment property of sgd noise and how it helps select flat minima: A stability analysis. Advances in Neural Information Processing Systems, 35:4680–4693, 2022.
- A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
- Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical Review Letters, 130(23):237101, 2023.
- A fourier perspective on model robustness in computer vision. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
- Kentaro Yoshioka. vision-transformers-cifar10: Training vision transformers (ViT) and related models on CIFAR-10. https://github.com/kentaroy47/vision-transformers-cifar10, 2024.
- How does learning rate decay help modern neural networks? arXiv preprint arXiv:1908.01878, 2019.
- Wide residual networks. In British Machine Vision Conference, 2016.
- Ildus Sadrtdinov (4 papers)
- Maxim Kodryan (6 papers)
- Eduard Pokonechny (1 paper)
- Ekaterina Lobacheva (17 papers)
- Dmitry Vetrov (84 papers)