Improving Generalization Performance by Switching from Adam to SGD (1712.07628v1)

Published 20 Dec 2017 in cs.LG and math.OC

Abstract: Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and LLMing with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks.

Citations (500)

View on Semantic Scholar

Summary

The paper introduces SWATS, a hybrid strategy that uses Adam for rapid convergence and strategically switches to SGD for better generalization.
It employs a triggering condition based on the projection of Adam's step on the gradient direction to automate the transition without extra tuning.
Experiments on image classification and language modeling tasks show SWATS effectively reduces the generalization gap between adaptive and non-adaptive methods.

Improving Generalization Performance by Switching from Adam to SGD

In the paper "Improving Generalization Performance by Switching from Adam to SGD," Keskar and Socher propose an optimization strategy that integrates the strengths of both adaptive methods, specifically Adam, and non-adaptive methods like Stochastic Gradient Descent (SGD), to enhance the generalization capabilities of machine learning models.

Background

Adaptive methods such as Adam, Adagrad, and RMSprop have gained popularity due to their efficiency in the initial phases of training, especially for ill-scaled problems. These methods leverage per-parameter learning rates, adapting them throughout the training process. Nevertheless, they have been observed to underperform in the later stages of training, particularly when measured by their ability to generalize to unseen data. SGD, although primitive, exhibits superior generalization capabilities due, in part, to its methodical convergence properties and resistance to overfitting.

The SWATS Strategy

The authors introduce a hybrid approach called SWATS (Switching from Adam to SGD) to exploit the rapid convergence of adaptive methods while ultimately yielding the robust generalization performance of SGD. The proposal involves starting the training with Adam and dynamically switching to SGD based on a triggering condition, ensuring minimal changes to the overall hyperparameter configuration.

The triggering condition for switching revolves around monitoring the alignment of the Adam step with the gradient subspace. This is quantitatively measured by examining the projection of Adam's step on the gradient direction. When a consistent alignment is observed, the switch to SGD is executed. Following the switch, an effective SGD learning rate is determined from the accumulated experience of the Adam phase, avoiding additional hyperparameter tuning.

Results

Experiments were conducted on diverse tasks including image classification on CIFAR-10, CIFAR-100, Tiny-ImageNet, and LLMing on the Penn Treebank and WikiText-2 datasets. The architectures explored were ResNet, SENet, DenseNet, PyramidNet, AWD-LSTM, and AWD-QRNN. The results demonstrated the promise of SWATS to reduce the generalization gap between Adam and SGD without necessitating complex tuning strategies or the incorporation of additional hyperparameters.

Specifically, on CIFAR datasets, SWATS effectively harnesses Adam's initialization power and transits to SGD, which results in superior generalization. Even on LLMing tasks where Adam generally excels, SWATS achieves comparable performance, suggesting its capability to maintain strong generalization while benefiting from rapid convergence in early training stages.

Implications and Future Directions

The paper's exploration into hybrid optimization techniques highlights a pivotal challenge in machine learning—the trade-off between training efficiency and generalization. By offering a method that automatically negotiates this transition, Keskar and Socher pave the way for optimization strategies that are both effective and resource-efficient.

The implications of this work extend to numerous domains where variant data scaling and rapid model iteration are prerequisites. In future research, the authors suggest investigating a more fluid transition between Adam and SGD, potentially through a weighted blending of their respective step directions. Such refinements may mitigate the occasional short-term performance dip observed when switching. Additionally, the potential interleaving of multiple switches back and forth between these methods warrants exploration, promising further optimization along the training continuum.

Ultimately, the SWATS strategy represents a pragmatic approach towards addressing the elusive goal of optimal model generalization in deep learning. It underscores the necessity for adaptive methodologies that balance rapid convergence with robust test accuracy across a spectrum of challenging tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AI__TECH/status/1849669869815595097

https://twitter.com/TechTweetBot/status/1849668340110807518

YouTube

Show All Videos

HackerNews

Improving Generalization Performance by Switching from Adam to SGD (2017) (1 point, 0 comments)