The Marginal Value of Adaptive Gradient Methods in Machine Learning (1705.08292v2)

Published 23 May 2017 in stat.ML and cs.LG

Abstract: Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

PDF Abstract

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Introduction

The paper "The Marginal Value of Adaptive Gradient Methods in Machine Learning" conducted by Ashia C. Wilson et al. investigates the efficacy of adaptive gradient methods for training deep neural networks. Examples of these adaptive methods include AdaGrad, RMSProp, and Adam. The paper juxtaposes these against traditional gradient descent (GD) and stochastic gradient descent (SGD) techniques. Through theoretical constructs and empirical evaluations, the researchers underscore that adaptive gradient methods often lead to solutions with inferior generalization capabilities compared to their non-adaptive counterparts, despite potentially better training performance.

Background

Optimization algorithms like SGD and its momentum variants are pivotal in minimizing risk in machine learning tasks. Typically, these methods follow the general update rule: $w_{k+1} = w_k - \alpha_k\, \tilde \nabla f(w_k),$ where $\tilde \nabla f(w_k)$ represents the gradient computed on a batch of data. Stochastic momentum methods, such as Polyak's heavy-ball method and Nesterov's Accelerated Gradient method, seek to expedite convergence by incorporating momentum terms. Notably, these non-adaptive methods optimize using the inherent $\ell_2$ geometry of the parameter space.

Adaptive gradient methods, however, adjust local learning rates based on the accumulated history of gradients. These methods (e.g., AdaGrad, RMSProp, Adam) generally exhibit the form: $w_{k+1} = w_k - \alpha_k \mathrm{H}_k^{-1}\tilde \nabla f(w_k + \gamma_k(w_k-w_{k-1})) +\beta_k\mathrm{H}_k^{-1} \mathrm{H}_{k-1} (w_k - w_{k-1}),$ where $\mathrm{H}_k$ is a positive definite matrix, often defined using the past gradients. This approach aims to tailor the algorithm according to the data's geometry.

Theoretical Analysis

Non-Adaptive Methods

Non-adaptive methods like SGD and its variants consistently converge to minimum norm solutions for least-squares classification problems. These solutions reside in the row span of the data matrix $X$ , leading to the maximum margin solutions among all possible solutions to $Xw=y$ .

Adaptive Methods

Conversely, the paper illustrates that adaptive methods can converge to suboptimal solutions under specific circumstances. Notably, when a binary least-squares classification problem permits sparse features and has overparameterized settings, adaptive methods may yield solutions characterized by low $\ell_\infty$ norms rather than low $\ell_2$ norms. The researchers present a lemma demonstrating that under certain conditions, adaptive methods such as AdaGrad, RMSProp, and Adam trend towards solutions that assign undue importance to spurious, non-generalizing features.

Empirical Evaluation

The empirical results provide concrete evidence that non-adaptive methods generalize better than adaptive methods across multiple deep learning tasks:

Convolutional Neural Network on CIFAR-10
- Despite AdaGrad and similar methods initially reducing training error faster, SGD and heavy-ball methods eventually surpassed adaptive methods in test error performance. The best test error attained by SGD was 7.65%, compared to RMSProp's 9.60%.
Character-Level LLMing
- On the War and Peace dataset, SGD achieved a test loss of 1.212, outperforming all adaptive methods. RMSProp performed relatively well but exhibited higher variance in initialization sensitivity.
Constituency Parsing
- For both the discriminative and generative parsing tasks, SGD and HB consistently outperformed adaptive methods. Specifically, SGD achieved the lowest perplexities, translating to better generalization capabilities.

Implications and Future Directions

This research holds significant implications for the training of neural networks. Adaptive gradient methods, despite their popularity and initial rapid training progress, may not be the optimal choice for tasks requiring robust generalization. Practitioners are encouraged to reconsider the default use of Adam or similar algorithms in favor of well-tuned SGD variants. Additionally, the paper speculates that certain domains like GANs and Q-learning might exhibit unique dynamics favoring adaptive methods. Further investigations are warranted to clarify whether these specific advantages are inherent to the tasks or merely artifacts of suboptimal tuning of non-adaptive methods.

Conclusion

The paper by Wilson et al. rigorously challenges the efficacy of adaptive gradient methods in machine learning. Through both theoretical constructs and empirical evidence, the researchers demonstrate that non-adaptive methods, despite requiring more meticulous tuning, tend to provide superior generalization performance. This work prompts a reevaluation of optimizer choices in deep learning practice, underscoring the nuanced dynamics underlying optimization and generalization.

Overall, while adaptive methods offer convenient and rapid initial training, their marginal value, particularly in terms of generalization, may necessitate a shift towards more traditional, albeit carefully tuned, non-adaptive methods.