Understanding the Generalization Benefits of Late Learning Rate Decay (2401.11600v1)
Abstract: Why do neural networks trained with large learning rates for a longer time often lead to better generalization? In this paper, we delve into this question by examining the relation between training and testing loss in neural networks. Through visualization of these losses, we note that the training trajectory with a large learning rate navigates through the minima manifold of the training loss, finally nearing the neighborhood of the testing loss minimum. Motivated by these findings, we introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks. Upon investigating the training process using SGD on our model, we demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss, which may achieve near-optimal generalization, thereby affirming the empirically observed benefits of late learning rate decay.
- Acceleration via fractal learning rate schedules. In International Conference on Machine Learning, pages 87–99. PMLR.
- Can sgd learn recurrent neural networks with provable generalization? Advances in Neural Information Processing Systems, 32.
- Why do we need weight decay in modern deep learning? arXiv preprint arXiv:2310.04415.
- Sgd with large step sizes learns sparse features. arXiv preprint arXiv:2210.05337.
- Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948–1024. PMLR.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070.
- On the benefits of large learning rates for kernel methods. In Conference on Learning Theory, pages 254–282. PMLR.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR.
- On gradient descent convergence beyond the edge of stability. arXiv preprint arXiv:2206.04172.
- On stationary-point hitting time and ergodicity of stochastic gradient langevin dynamics. Journal of Machine Learning Research.
- Robust implicit regularization via weight normalization. arXiv preprint arXiv:2305.05448.
- Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065.
- Cooper, Y. (2018). The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200.
- Cooper, Y. (2020). The critical locus of overparameterized neural networks. arXiv preprint arXiv:2005.04210.
- Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461.
- Self-stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594.
- Stochastic Processes: From Applications to Theory. CRC Press.
- Exponential mixing properties for time inhomogeneous diffusion processes with killing. Bernoulli.
- Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
- (s) gd over diagonal linear networks: Implicit regularisation, large stepsizes and edge of stability. arXiv preprint arXiv:2302.08982.
- Grimmer, B. (2023). Provably faster gradient descent via long steps. arXiv preprint arXiv:2307.06324.
- Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31.
- Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR.
- Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949.
- Flat minima. Neural computation, 9(1):1–42.
- Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30.
- Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27, pages 392–402. Springer.
- Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pages 9965–10040. PMLR.
- Katzenberger, G. S. (1990). Solutions of a stochastic differential equation forced onto a manifold by a large drift. The University of Wisconsin-Madison.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- An alternative view: When does sgd escape local minima? In International conference on machine learning, pages 2698–2707. PMLR.
- Kulik, A. (2017). Ergodic Behavior of Markov Processes: With Applications to Limit Theorems, volume 67. Walter de Gruyter GmbH & Co KG.
- Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31.
- Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110. PMLR.
- Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. The Journal of Machine Learning Research, 20(1):1474–1520.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32.
- Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. Advances in Neural Information Processing Systems, 33:14544–14555.
- On the validity of modeling sgd with stochastic differential equations (sdes). Advances in Neural Information Processing Systems, 34:12712–12725.
- What happens after sgd reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914.
- Fast mixing of stochastic gradient descent with normalization and weight decay. Advances in Neural Information Processing Systems, 35:9233–9248.
- Just interpolate: Kernel “ridgeless” regression can generalize. Annals of Statistics.
- Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663.
- Lorch, E. (2016). Visualizing deep network training trajectories with pca. In ICML Workshop on Visualization for Deep Learning.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890.
- Beyond the quadratic approximation: the multiscale structure of neural network loss landscapes. arXiv preprint arXiv:2204.11326.
- A variational analysis of stochastic gradient algorithms. In International conference on machine learning, pages 354–363. PMLR.
- Stability of markovian processes iii: Foster–lyapunov criteria for continuous-time processes. Advances in Applied Probability, 25(3):518–548.
- Power-law escape rate of sgd. In International Conference on Machine Learning, pages 15959–15975. PMLR.
- The implicit bias of minima stability: A view from function space. Advances in Neural Information Processing Systems, 34:17749–17761.
- Implicit bias of the step size in linear diagonal neural networks. In International Conference on Machine Learning, pages 16270–16295. PMLR.
- Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230.
- Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177.
- Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Smith, L. N. (2017). Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE.
- On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176.
- Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489.
- A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451.
- Implicit regularization for optimal sparse recovery. Advances in Neural Information Processing Systems, 32.
- Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
- Large learning rate tames homogeneity: Convergence and balancing effect. arXiv preprint arXiv:2110.03677.
- Applied stochastic analysis, volume 199. American Mathematical Soc.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688.
- Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR.
- On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123.
- Direction matters: On the implicit bias of stochastic gradient descent with moderate learning rate. arXiv preprint arXiv:2011.02538.
- How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31.
- The alignment property of sgd noise and how it helps select flat minima: A stability analysis. Advances in Neural Information Processing Systems, 35:4680–4693.
- Implicit regularization and convergence for weight normalization. Advances in Neural Information Processing Systems, 33:2835–2847.
- A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pages 1980–2022. PMLR.
- Towards theoretically understanding why sgd generalizes better than adam in deep learning. Advances in Neural Information Processing Systems, 33:21285–21296.
- Understanding edge-of-stability training dynamics with a minimalist example. arXiv preprint arXiv:2210.03294.
- Grokking phase transitions in learning local rules with gradient descent. arXiv preprint arXiv:2210.15435.
- Yinuo Ren (14 papers)
- Chao Ma (187 papers)
- Lexing Ying (159 papers)