Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models (2208.06677v5)

Published 13 Aug 2022 in cs.LG and math.OC

Abstract: In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then, Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $\mathcal{O}(\epsilon^{-3.5})$ stochastic gradient complexity on the non-convex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE, etc., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has been used in multiple popular deep learning frameworks or projects.

Authors (5)

Xingyu Xie (13 papers)
Pan Zhou (220 papers)
Huan Li (102 papers)
Zhouchen Lin (158 papers)
Shuicheng Yan (275 papers)

Citations (116)

View on Semantic Scholar

Summary

The paper introduces Adan, a novel optimizer that integrates Nesterov momentum with adaptive gradient methods to improve training efficiency without added overhead.
The paper provides rigorous theoretical analysis showing that Adan achieves convergence complexity near the best-known lower bounds for nonconvex stochastic problems.
Empirical results demonstrate that Adan outperforms traditional methods across domains, setting new state-of-the-art benchmarks in computer vision and NLP models.

Analyzing the Adan Optimization Algorithm

The paper introduces an optimization algorithm named Adan, designed to improve the efficiency of training deep neural networks (DNNs). Adan stands out by integrating Nesterov acceleration into adaptive gradient methods, offering significant potential for accelerated convergence and practical ease.

Key Contributions

Nesterov Momentum Estimation: Adan employs a novel reformulation of Nesterov acceleration, allowing the estimation of first- and second-order moments of gradients without incurring additional computational overhead. This addresses a common challenge in deep learning optimization: balancing convergence speed and computational cost.
Provable Convergence: The authors provide rigorous theoretical analysis, demonstrating that Adan achieves a convergence complexity on par with the best-known lower bounds for nonconvex stochastic problems. Specifically, Adan outperforms existing adaptive methods like Adam and Adabelief in terms of complexity, reducing constants by orders of magnitude.
Robust Performance Across Domains: Empirically, Adan exhibits superior performance across various tasks, including computer vision and natural language processing. It sets new state-of-the-art results on popular architectures such as ResNet, Vision Transformers (ViTs), and BERT, showcasing its versatility and effectiveness.

Numerical Results and Claims

Convergence Speed: The stochastic gradient complexity is improved theoretically and demonstrated empirically. For instance, training convergence is significantly faster for models like ResNets and ViTs, particularly under various batch sizes.
Performance and Generalization: Adan provides enhanced generalization capabilities owing to its novel weight decay mechanism. This allows it to maintain or improve performance even when reducing training epochs or increasing batch sizes.

Practical and Theoretical Implications

Wide Applicability: The ability to set new benchmarks in multiple domains indicates Adan’s potential as a universal optimizer that simplifies the choice of optimization algorithms in various application scenarios.
Theoretical Insights: By achieving convergence rates closer to theoretical limits, Adan may inspire further research into optimization techniques that effectively leverage curvature information for large-scale DNNs.

Future Directions

Expanding the scope of Adan to explore its applicability and performance in more specialized tasks and architectures, such as graph neural networks or generative models, could be promising.
Developing adaptive versions that further minimize manual hyperparameter tuning could enhance usability, especially in environments with limited computational resources.

In summary, Adan represents a significant step forward in optimizer design, blending theoretical robustness with practical effectiveness to enhance DNN training across a spectrum of modern architectures and tasks.

PDF Markdown

Related Papers

GitHub

GitHub - sail-sg/Adan: Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models (759 stars)

YouTube

Show All Videos