Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling (2405.14578v5)

Published 23 May 2024 in cs.LG

Abstract: In current deep learning tasks, Adam style optimizers such as Adam, Adagrad, RMSProp, Adafactor, and Lion have been widely used as alternatives to SGD style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly or follows similar rules with batch size for SGD style optimizers. However, this conclusion is not applicable to Adam style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conducted experiments on various CV and NLP tasks and verified the correctness of the scaling law.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Shuaipeng Li (11 papers)
  2. Penghao Zhao (7 papers)
  3. Hailin Zhang (51 papers)
  4. Xingwu Sun (32 papers)
  5. Hao Wu (625 papers)
  6. Dian Jiao (10 papers)
  7. Weiyan Wang (12 papers)
  8. Chengjun Liu (12 papers)
  9. Zheng Fang (104 papers)
  10. Jinbao Xue (13 papers)
  11. Yangyu Tao (19 papers)
  12. Bin Cui (165 papers)
  13. Di Wang (408 papers)
Citations (5)

Summary

  • The paper establishes a novel scaling law showing that optimal learning rates surge and then decline as batch sizes increase for Adam-style optimizers.
  • It derives theoretical formulations and validates the model through extensive experiments on CNNs, ResNets, and Transformer architectures.
  • The findings offer actionable insights for effective hyperparameter tuning and adaptive training schedules, challenging traditional linear scaling assumptions.

An Analysis of the Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling for Adam-style Optimizers

The paper "Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling" by Li et al. explores the intricate relationship between batch size and optimal learning rates for Adam-style optimizers. This relationship is critically important for effective convergence in training deep learning models. Unlike Stochastic Gradient Descent (SGD)-style optimizers, where linear scaling laws have been well explored, the nuances for Adam-style optimizers have remained less clear. This paper provides theoretical foundations and empirical evidence for understanding these nuances.

Key Insights and Contributions

The paper makes several significant contributions:

  1. New Scaling Law for Adam-style Optimizers: The authors establish a novel scaling law that dictates the relationship between batch size and optimal learning rate in Adam-style optimizers. The law indicates that the optimal learning rate first increases and then decreases as the batch size grows, a behavior termed as a "surge" phenomenon. This reveals a non-monotonic trend, contradicting prior approximations that suggested a simple linear or square root-like scaling.
  2. Theoretical Foundations: Building on empirical models and previous research, the paper derives mathematical expressions that explain this surge. The optimal learning rate is expressed as:

ϵopt(B)=ϵmax12(BnoiseB+BBnoise)\epsilon_{opt}(B) = \frac{\epsilon_{max}}{\frac{1}{2} \left( \sqrt{\frac{\mathcal{B}_{noise}}{B}} + \sqrt{\frac{B}{\mathcal{B}_{noise}}} \right)}

where Bnoise\mathcal{B}_{noise} represents a critical batch size beyond which the learning rate needs to be adjusted downwards.

  1. Experimental Validation: The authors conduct extensive experiments across various computer vision (CV) and NLP tasks. They validate the theoretical model by showing that the optimal learning rate does indeed follow the predicted trend, contradicting prior models and providing a more accurate depiction of the convergence dynamics.

Empirical Analysis and Results

The empirical studies were conducted on three typical deep learning tasks: training a CNN on Fashion-MNIST, a ResNet-18 on Tiny-ImageNet, and a Transformer model on the ELI5-Category dataset. Each task confirmed the theoretical predictions:

  • CNN-FashionMNIST: The experiments demonstrated that for small batch sizes, the optimal learning rate increases, reaches a peak, and then dips as batch sizes continue to grow.
  • ResNet18-TinyImageNet: By training this model across different stages, it was observed that Bnoise\mathcal{B}_{noise} and the peak batch size increased as training progressed, supporting the dynamic nature of the scaling law.
  • DistilGPT2-ELI5Category: The results showed consistency in both "sign of gradient" and default Adam parameter configurations, further corroborating the universality of the derived scaling law.

Implications and Future Directions

The findings have several practical and theoretical implications:

  • Hyperparameter Tuning: This paper provides a more nuanced understanding that can significantly streamline hyperparameter tuning, particularly for batch size and learning rate configurations in Adam-style optimizers.
  • Adaptive Training Schedules: The demonstrated non-monotonic relationship suggests that employing adaptive learning rates and batch sizes during training could optimize the convergence process, potentially reducing training time and improving efficiency.
  • Generalizability to Other Optimizers: Although this paper focuses on Adam-style optimizers, the methods and insights could be extended to analyze and improve other adaptive optimizers.

Discussion and Limitations

The paper acknowledges potential limitations, including the influence of additional factors such as weight decay and gradient clipping which were beyond the scope of this paper. Future research may look into integrating these factors to further refine the scaling laws. Additionally, exploring adaptive scheduling strategies leveraging the identified surge phenomenon could provide practical benefits in real-world training scenarios.

Conclusion

Li et al.'s paper presents a significant advancement in understanding the scaling relationship between batch size and optimal learning rate for Adam-style optimizers. By combining theoretical analysis with empirical validation, the paper provides robust evidence of the non-monotonic scaling law, which challenges previous approximations. This work not only contributes to the theoretical landscape of optimizer scaling laws but also offers practical benefits for the efficient training of deep learning models. Future developments may include adaptive training schedules that dynamically adjust learning rates and batch sizes in response to training progress, leveraging the insights provided by this research.