Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Where Do Large Learning Rates Lead Us? (2410.22113v1)

Published 29 Oct 2024 in cs.LG and stat.ML

Abstract: It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Dissecting the high-frequency bias in convolutional neural networks. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 863–871, 2021.
  2. Learning threshold neurons via edge of stability. Advances in Neural Information Processing Systems, 36, 2024.
  3. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023.
  4. A modern look at the relationship between sharpness and generalization. In International Conference on Machine Learning, pages 840–902. PMLR, 2023a.
  5. Why do we need weight decay in modern deep learning? In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023b.
  6. Sgd with large step sizes learns sparse features. In International Conference on Machine Learning, pages 903–925. PMLR, 2023c.
  7. Theoretical analysis of auto rate-tuning by batch normalization. In International Conference on Learning Representations, 2019.
  8. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  9. Implicit gradient regularization. In International Conference on Learning Representations, 2021.
  10. Mikhail Belkin. The necessity of machine learning theory in mitigating ai risk. ACM / IMS J. Data Sci., jan 2024. doi: 10.1145/3643694. Just Accepted.
  11. Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012.
  12. Stochastic collapse: How gradient noise attracts SGD dynamics towards simpler subnetworks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  13. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021.
  14. Constructor Research Platform. URL https://constructor.tech/products/research-platform.
  15. Randaugment: Practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems, volume 33, 2020.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  17. The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations, 2022.
  18. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  19. Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In The Eleventh International Conference on Learning Representations, 2022.
  20. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  21. Hidden symmetries of ReLU networks. In International Conference on Machine Learning (ICML), pages 11734–11760. PMLR, 2023.
  22. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  23. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition, 2016.
  24. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
  25. Wide-minima density hypothesis and the explore-exploit learning rate schedule. arXiv preprint arXiv:2003.03977, 2020.
  26. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 876–885, 2018.
  27. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
  28. On the relation between the sharpest directions of DNN loss and the SGD step length. In International Conference on Learning Representations, 2019.
  29. The break-even point on optimization trajectories of deep neural networks. In International Conference on Learning Representations, 2020.
  30. Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772–4784. PMLR, 2021.
  31. On the maximum hessian eigenvalue and generalization. In Proceedings on "I Can’t Believe It’s Not Better! - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, 2023.
  32. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  33. Last layer re-training is sufficient for robustness to spurious correlations. In International Conference on Machine Learning, 2022.
  34. An alternative view: When does sgd escape local minima? In International conference on machine learning, pages 2698–2707. PMLR, 2018.
  35. Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35:14058–14070, 2022.
  36. HPC resources of the higher school of economics. Journal of Physics: Conference Series, 1740:012050, 2021.
  37. Learning multiple layers of features from tiny images. 2009.
  38. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  39. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  40. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
  41. Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  42. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. Advances in Neural Information Processing Systems, 33, 2020.
  43. On the periodic behavior of neural network training with batch normalization and weight decay. In Advances in Neural Information Processing Systems, 2021.
  44. Benign oscillation of stochastic gradient descent with large learning rates. arXiv preprint arXiv:2310.17074, 2023.
  45. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations (ICLR), 2020.
  46. On linear stability of sgd and input-smoothness of neural networks. Advances in Neural Information Processing Systems, 34:16805–16817, 2021.
  47. Mixed precision training. In International Conference on Learning Representations (ICLR), 2018.
  48. Special properties of gradient descent with large learning rates. In International Conference on Machine Learning, pages 25082–25104. PMLR, 2023.
  49. Benchopt: Reproducible, efficient and collaborative optimization benchmarks. Advances in Neural Information Processing Systems, 35:25404–25421, 2022.
  50. The implicit bias of minima stability: A view from function space. Advances in Neural Information Processing Systems, 34:17749–17761, 2021.
  51. Implicit bias of the step size in linear diagonal neural networks. In International Conference on Machine Learning, pages 16270–16295. PMLR, 2022.
  52. The implicit bias of minima stability in multivariate shallow ReLU networks. In The Eleventh International Conference on Learning Representations, 2023.
  53. Loss function dynamics and landscape for deep neural networks trained with quadratic loss. In Doklady Mathematics, volume 106, pages S43–S62. Springer, 2022.
  54. How do vision transformers work? In International Conference on Learning Representations (ICLR), 2022.
  55. Stable minima cannot overfit in univariate ReLU networks: Generalization by large step sizes. arXiv preprint arXiv:2406.06838, 2024.
  56. Understanding the generalization benefits of late learning rate decay. In International Conference on Artificial Intelligence and Statistics, 2024.
  57. Outliers with opposing signals have an outsized effect on neural network optimization. In The Twelfth International Conference on Learning Representations, 2024.
  58. On learning rates and schrödinger operators. Journal of Machine Learning Research, 24(379):1–53, 2023.
  59. On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2021.
  60. Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations, 2022.
  61. The implicit regularization of dynamical stability in stochastic gradient descent. In International Conference on Machine Learning, pages 37656–37684. PMLR, 2023.
  62. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
  63. The alignment property of sgd noise and how it helps select flat minima: A stability analysis. Advances in Neural Information Processing Systems, 35:4680–4693, 2022.
  64. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
  65. Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical Review Letters, 130(23):237101, 2023.
  66. A fourier perspective on model robustness in computer vision. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
  67. Kentaro Yoshioka. vision-transformers-cifar10: Training vision transformers (ViT) and related models on CIFAR-10. https://github.com/kentaroy47/vision-transformers-cifar10, 2024.
  68. How does learning rate decay help modern neural networks? arXiv preprint arXiv:1908.01878, 2019.
  69. Wide residual networks. In British Machine Vision Conference, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ildus Sadrtdinov (4 papers)
  2. Maxim Kodryan (6 papers)
  3. Eduard Pokonechny (1 paper)
  4. Ekaterina Lobacheva (17 papers)
  5. Dmitry Vetrov (84 papers)

Summary

Understanding the Influence of Initial Learning Rates in Neural Networks Training

The paper titled "Where Do Large Learning Rates Lead Us?" by Sadrtdinov et al. presents an in-depth empirical analysis of the role and impact of large initial learning rates (LRs) in training neural networks. The paper challenges and refines existing conventions within deep learning practice, specifically focusing on the desired size of initial LRs for optimal neural network performance.

Key Findings

The research confronts the ambiguous stance toward large LRs, by posing two prominent questions: 1) determining the appropriate range of large initial LRs for optimal results, and 2) identifying the distinguishing features of models trained with different LRs. The paper classifies LR behaviors into three regimes: convergence, chaotic equilibrium, and divergence.

  1. Empirical Boundary for Learning Rates: It is determined that optimal generalization is achieved by employing initial LRs slightly exceeding the threshold required for network convergence. This strategically positions the learning process in the chaotic equilibrium regime. This finding challenges the prevailing wisdom that emphasizes only the necessity of large LRs, but not the specific quantitative considerations of their magnitude.
  2. Landscape of Solutions: Models initialized with these optimal LRs tend to locate loss landscape basins that house high-quality minima. This significant insight reveals that severe large LRs lead models into broad, high-error basins, whereas slightly larger-than-convergence LRs identify regions densely packed with effective solutions.
  3. Sparse Feature Learning: Training networks with the optimal range of LRs modifies feature learning dynamics. The results exhibit model specialization with sparse sets of features leading to more focused architectures capable of improved generalization. This contrasts with non-optimal LRs which either disperse feature activation too widely or fail to leverage meaningful feature patterns.
  4. Practical Implications for Image Classification: Extending empirical findings from synthetic data to real-world data sets, such as CIFAR-10, reveals similar feature-learning dynamics, confirming the transferability of results. Frequency analysis indicates that networks trained with fine-tuned LRs focus substantially on mid-frequency features beneficial for classification tasks.

Implications and Future Directions

The findings provide profound implications for optimization strategies in neural network training. They suggest that meticulous selection and tuning of initial LRs can avert poor generalization outcomes even when computational resources remain constant. Furthermore, understanding these dynamics may serve theoretical understanding of non-convex optimization landscapes, signaling potential paths for robust architecture design.

Nevertheless, this paper uncovers the need for more research into the relationship between dynamic feature learning and landscape geometry. The sparsification phenomenon raises questions about the inherent trade-offs in interpretability and model robustness that merit further theoretical exploration. Additionally, future research could extend these empirical investigations to other model architectures and data domains, hence broadening the foundational knowledge of large-LR effects in neural network training.

In conclusion, this work propels the discourse on neural network optimization considerably forward, providing invaluable clarity into an often-overlooked aspect of model training—precisely calibrated learning rates. Through rigorous empirical methodology, it lays the groundwork for evolving best practices in model training and architecture design, enhancing both generalization and efficiency in deep learning.