Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks (1907.04595v2)

Published 10 Jul 2019 in cs.LG and stat.ML

Abstract: Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuanzhi Li (119 papers)
  2. Colin Wei (17 papers)
  3. Tengyu Ma (117 papers)
Citations (274)

Summary

  • The paper demonstrates that a large initial learning rate delays premature memorization, yielding improved regularization during training.
  • It uses a simplified two-layer network model and theoretical proofs to reveal how SGD noise steers balanced pattern learning.
  • Empirical experiments on modified datasets, including CIFAR-10, confirm that annealed learning rates help mitigate overfitting by prioritizing harder-to-fit, generalizable patterns.

The Regularization Effect of Initial Large Learning Rate in Neural Network Training

The paper "Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks" investigates the intriguing phenomenon where a large initial learning rate, when annealed, often results in better generalization of neural networks compared to starting with a small learning rate. Despite large learning rates slowing down optimization in the early stages, their longer-term benefits for model generalization have been empirically observed, yet lacked a thorough theoretical understanding. This work seeks to bridge this gap through a detailed theoretical analysis and controlled experimental setups.

The authors construct a simplified two-layer neural network setting to elucidate their hypothesis, showing how networks trained with a large initial learning rate can generalize better. The crux of their analysis hinges on the order in which different patterns are learned. In particular, networks trained with a small learning rate prioritize "easy-to-generalize, hard-to-fit" patterns, which are learned quickly due to their relatively simpler nature. However, this initial memorization can impede performance, as the network struggles later to adapt to "hard-to-generalize, easy-to-fit" patterns due to reduced SGD noise and higher overfitting risk on the former. In contrast, a large learning rate skews the learning initially towards the latter type of patterns, providing a more balanced learning process that capitalizes post-annealing on improved handling of "easy-to-generalize" patterns.

The authors back their claims with both theoretical proofs and experimental validations. The theory demonstrates how increased SGD noise with large initial learning rates regularizes the learning dynamics, effectively resulting in a more generalized model. On simpler datasets, they show observable benefits in using a large initial learning rate followed by annealing, as their theoretical model predicts.

Specifically, large initial learning rates provide an implicit regularization by delaying the memorization of simpler but noisier data patterns. The model's noise-stabilization property during these early training stages helps it explore a wider set of potential minima. When the learning rate is subsequently reduced, the model consolidates useful patterns that generalize better, delivering a performance boost over models trained strictly with small learning rates. Their proof leverages the noise's stabilizing effect on learning dynamics, quantified through extensive mathematical formulation and analytical bounds.

Empirically, the paper explores artificial modifications to popular datasets like CIFAR-10, by introducing memorizable patterns (patches) to help vividly demonstrate learning order effects. The experiments show that networks initiated with a large learning rate ignore these patches early on, focusing more on the variability introduced through annealing, while networks with initially small learning rates largely focus on these patches prematurely at the cost of generalization.

The practical implications of these findings are significant, suggesting that initial learning rate strategies play a vital role not just in convergence speed but in controlling overfitting through order-determined regularization. The authors also propose an alternative strategy. By adding Gaussian noise to activations which gets annealed, they replicate and validate the regularization benefits of a large initial learning rate under small learning rate conditions, broadening practical options for neural network training.

Looking forward, this research could stimulate further exploration into advanced learning rate schedules or noise-induced regularization techniques, potentially extending beyond supervised learning scenarios. Such explorations can further inform robust design heuristics for training deep networks, ultimately aiming for models that generalize better across diverse datasets and tasks without requiring exhaustive hyperparameter tuning. While the paper focuses on two-layer networks for theoretical tractability, extending these insights to deeper architectures and more complex datasets presents a promising avenue for future work.