AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients (2010.07468v5)

Published 15 Oct 2020 in cs.LG, cs.CV, and stat.ML

Abstract: Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and LLMing. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer

PDF Abstract

Overview of the AdaBelief Optimizer

The paper introduces AdaBelief, a novel optimizer designed for training deep neural networks by improving upon existing adaptive optimizers such as Adam. AdaBelief aims to combine the strengths of both adaptive optimizers and stochastic gradient descent (SGD) methods—it seeks to deliver fast convergence and training stability akin to adaptive methods while maintaining the generalization performance typically associated with SGD.

Motivation and Core Idea

Most optimizers for deep learning fall under adaptive methods, like Adam, which efficiently utilize individual learning rates for parameters, or stochastic methods, like SGD with momentum, which utilize a global learning rate. Adaptive methods usually offer faster convergence but often result in worse generalization compared to SGD. The AdaBelief algorithm innovatively adapts step sizes based on the belief in the observed gradient direction.

In AdaBelief, the belief is quantified by viewing the exponential moving average (EMA) of the gradients as predictions for future gradient values. If the presently observed gradient deviates significantly from this EMA prediction, it's indicative of a low belief or confidence, leading to smaller step sizes. Conversely, a higher belief or confidence in the correctness of the gradient prediction results in larger step sizes. This approach aims to straighten out the challenges of achieving stability in training complex models like GANs while maintaining competitive performance across various domains.

Methodology

AdaBelief distinguishes itself from other adaptive methods like Adam by the way it treats observed gradients. In Adam, the update direction normalizes gradients by their EMA of squared values. Meanwhile, AdaBelief introduces a normalization based on the squared deviation from its EMA. This subtle alteration affords a kind of curvature information usage absent from Adam, consequently making the algorithm more robust to the “sharp” regions of the loss landscape.

Theoretical analysis presented in the paper ensures convergence in both convex and non-convex optimization scenarios, providing a foundation for the ultra-stability exhibited during experimental validations.

Experimental Validation

Extensive experiments across various tasks such as image classification and LLMing underscore AdaBelief's potential. Notably, on image classification tasks involving CIFAR datasets with CNN-based architectures such as ResNet and DenseNet, AdaBelief matches the quick learning pace of Adam and RMSProp, while significantly outperforming them in generalization—a concern previously addressed by switching to SGD after a certain period.

In the challenging training of GANs, where stability issues plague many existing optimizers leading to mode collapse, AdaBelief outshines others by maintaining stability and quality of generated samples. The optimizer showcased superior performance by achieving lower Frechet Inception Distance (FID) scores during generative tasks, implying it successfully captures both quality and diversity in generated outputs.

Implications and Future Work

The evidence supports AdaBelief as a practical tool for training diverse models in deep learning ecosystems, offering stability and robust generalization without additional tuning overhead compared to existing variants like Adam. This development opens up possibilities for further exploration into gradient variance utilization, potentially catalyzing advancements in optimizing ever more complex models and architectures.

As the depth and breadth of machine learning models grow, the importance of adaptive optimization to bridge speed and fidelity becomes increasingly paramount—AdaBelief is a promising stride in that direction. Future work may involve deeper theoretical investigations into the curvature adaptation mechanism, extending this approach across other learning paradigms, or hybridizing with second-order techniques to tackle exceptionally large-scale optimization tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Juntang Zhuang (24 papers)
Tommy Tang (2 papers)
Yifan Ding (44 papers)
Sekhar Tatikonda (33 papers)
Nicha Dvornek (8 papers)
Xenophon Papademetris (4 papers)
James S. Duncan (67 papers)

Citations (455)

View on Semantic Scholar