Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks (1806.06763v3)

Published 18 Jun 2018 in cs.LG and stat.ML

Abstract: Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain a fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

PDF Abstract

Insights on Closing the Generalization Gap of Adaptive Gradient Methods

This paper addresses a notable challenge within the domain of training deep neural networks (DNNs): the performance discrepancy between adaptive gradient methods and stochastic gradient descent (SGD) with momentum in terms of generalization. The authors propose a novel approach to bridge this gap by unifying the strengths of Adam and SGD through a new method called Partially Adaptive Momentum Estimation (Padam).

Key Contributions

Introduction of Padam: The methodology central to this paper introduces Padam, an algorithm that combines the rapid convergence capabilities of adaptive gradient methods like Adam and Amsgrad with the strong generalization performance exhibited by SGD-momentum. The novelty lies in the incorporation of a partially adaptive parameter, $p$ , which enables a selective level of adaptiveness across the parameter space.
Addressing the Small Learning Rate Dilemma: Padam confronts the "small learning rate dilemma" prevalent in fully adaptive methods. By tuning $p$ adequately, Padam maintains a relatively larger effective learning rate, providing both the capability to explore the parameter space effectively and mitigate issues such as gradient explosion.
Theoretical Guarantees: The paper also lays a rigorous theoretical foundation for the convergence of Padam in the field of nonconvex stochastic optimization. Notably, the authors derive the convergence rate of Padam to a stationary point, demonstrating that Padam exhibits a possibly improved convergence rate over nonconvex SGD under certain conditions—particularly when the cumulative stochastic gradients are sparse.
Experimental Validation: The experimental results presented confirm that Padam not only converges to a solution faster than SGD with momentum but also achieves generalization performance comparable to it. The experiments span well-known datasets such as CIFAR-10, CIFAR-100, ImageNet, and Penn Treebank, using state-of-the-art architectures like VGGNet, ResNet, WideResNet, and LSTM.

Numerical Outcomes and Implications

Padam outperforms or matches the test accuracy of SGD in multiple settings. For instance, it achieves a 93.78% accuracy with VGGNet on CIFAR-10, surpassing SGD's 93.71%. On the challenging ImageNet dataset, Padam nearly matches or exceeds the performance of SGD across various networks, achieving the best Top-1 and Top-5 accuracies for certain models.

The results imply that Padam can be considered a viable alternative to SGD-momentum in scenarios demanding efficient convergence without compromising generalization. Its adaptive capabilities provide a more flexible framework for training DNNs, offering practitioners a tool to leverage the benefits of various learning paradigms integratively.

Theoretical Implications and Future Directions

The convergence analysis provided not only supports Padam's empirically observed efficiency but also opens avenues for further exploration in stochastic nonconvex optimization. This could spur investigations into the nuances of adaptiveness, potentially leading to more refined algorithms tailored to specific classes of problems or architectures.

In conclusion, Padam's introduction and validation through both theoretical and practical lenses provide the machine learning community an enhanced optimizer that mitigates traditionally observed shortcomings in adaptive gradient methods. As AI systems become increasingly sophisticated, algorithms like Padam can play a crucial role in efficiently harnessing the potential of deep learning models. Future research might delve into various extensions of the partially adaptive concept—applying it to other adaptive methods or architectures and exploring the trade-offs involved in different application contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jinghui Chen (50 papers)
Dongruo Zhou (51 papers)
Yiqi Tang (7 papers)
Ziyan Yang (15 papers)
Yuan Cao (201 papers)
Quanquan Gu (198 papers)

Citations (188)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - uclaml/Padam: Partially Adaptive Momentum Estimation method in the paper "Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks" (accepted by IJCAI 2020) (39 stars)