Quasi-hyperbolic momentum and Adam for deep learning (1810.06801v4)

Published 16 Oct 2018 in cs.LG and stat.ML

Abstract: Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. Code is immediately available.

Citations (122)

View on Semantic Scholar

Summary

Overview of Quasi-Hyperbolic Momentum and Adam for Deep Learning

In the field of optimization for deep learning, momentum-based variations of stochastic gradient descent (SGD) are extensively utilized. The paper "Quasi-hyperbolic momentum and Adam for deep learning" introduces an intriguing modification to this framework via the quasi-hyperbolic momentum (QHM) algorithm. This algorithm is conceptualized as a straightforward alteration of momentum SGD, where a plain SGD step is averaged with a momentum step. Through this introduction, the authors succeed in unifying various optimization algorithms within a solitary framework and extend this concept to propose a variant of the Adam optimizer, termed QHAdam.

Algorithmic Foundation

The QHM algorithm is formulated on the foundation of gradient variance reduction. It operates by interpolating between the original gradients and a smoothed version using a momentum term. The update rule for QHM is represented as a weighted average of the momentum and plain SGD steps. The paper positions QHM in relation to several existing optimization algorithms, including NAG, PID controllers, noise-robust momentum methods, and others, highlighting its versatile nature.

QHAdam extends these concepts into the domain of adaptive learning rate methods, notably the Adam optimizer. By incorporating quasi-hyperbolic momentum into both the first and second moment estimators of Adam, QHAdam offers potential improvements in training stability and convergence.

Practical Implications

Empirically, the paper demonstrates that both QHM and QHAdam outperform their classical counterparts in diverse deep learning contexts. A notable achievement is a state-of-the-art result on the WMT16 EN-DE translation task, where the proposed QHAdam algorithm achieves a BLEU score of 29.45. This result underlines the practical efficiency and applicability of the proposed algorithms in real-world scenarios.

Theoretical Insights and Future Directions

From a theoretical standpoint, the paper suggests that QHM encapsulates a broad spectrum of two-state optimization algorithms. This unification provides a streamlined path for understanding and developing new optimization algorithms. The convergence results inherited through connections with algorithms like Triple Momentum offer QHM a solid theoretical grounding.

For future research directions, the paper hints at several potential paths. These include developing adaptive strategies for determining the QHM and QHAdam hyperparameters and exploring the convergence properties in more generalized stochastic settings. Furthermore, extending the analysis to distributed or asynchronous environments presents an avenue for leveraging QHM's structured approach in large-scale settings.

Conclusion

This paper's contribution lies in its minimalist yet impactful redesign of SGD with momentum and Adam, providing a simple yet expressive framework to enhance deep learning optimization. The authors present a cohesive narrative that bridges various existing algorithms, emphasizing both practicality and an underlying theoretical rigor, which is likely to influence subsequent research and application in the field of optimization for deep learning.

Related Papers

YouTube

Show All Videos