Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods (1712.09677v2)

Published 27 Dec 2017 in math.OC, cs.LG, cs.NA, math.NA, and stat.ML

Abstract: In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent. We prove global nonassymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates (in L2 sense), and dual function values. We also show that the primal iterates converge at an accelerated linear rate in the L1 sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesaro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.

Authors (2)

Nicolas Loizou (38 papers)
Peter Richtárik (241 papers)

Citations (195)

View on Semantic Scholar

Summary

Overview of the Paper on Momentum and Stochastic Momentum in Optimization

This paper, authored by Nicolas Loizou and Peter Richtárik, contributes to the understanding of momentum methods in the context of stochastic optimization. The primary focus is on the integration of heavy ball momentum into several stochastic optimization algorithms, notably stochastic gradient descent (SGD), stochastic Newton (SN), stochastic proximal point (SPP), and stochastic dual subspace ascent (SDSA). These algorithms are analyzed under a unified framework where they are functionally equivalent, enabling a direct comparison of their properties when enhanced with momentum.

Main Contributions

Introduction of Momentum Variants: The paper introduces momentum and stochastic momentum variants for a range of stochastic optimization methods. These modifications aim to improve the convergence rates of the underlying algorithms by incorporating the heavy ball momentum term, which is a strategy originally designed to accelerate the convergence of gradient-based methods.
Theoretical Analysis of Convergence:
- Linear Convergence with Momentum: The authors prove that these momentum-enhanced stochastic methods exhibit global non-asymptotic linear convergence rates. This result is significant for the stochastic heavy ball method, as it provides the first rigorous proof of a linear rate under momentum, filling a gap in the existing literature.
- Accelerated Linear Convergence: Beyond mere linear convergence, the paper demonstrates that under certain conditions, the convergence can be accelerated, achieving a rate that depends on the square root of the condition number of the problem rather than the condition number itself. This aligns with the rates expected from momentum's theoretical promises but had not been formally established for stochastic settings until this work.
- Sublinear Convergence for Cesàro Averages: For cases where weaker assumptions are made, the paper proves sublinear convergence rates for the Cesàro averages of iterates, showing robustness of the techniques under various scenarios.
Stochastic Momentum: A novel concept introduced in the paper is stochastic momentum, which approximates the momentum step stochastically, thus potentially reducing computational costs in each iteration. This approach is shown to offer a computational advantage in situations with sparse data and suitable momentum parameters.
Primal-Dual Correspondence: The research highlights a direct correspondence between primal and dual approaches, demonstrating that enhancements in the dual momentum methods naturally translate to their primal counterparts.
Numerical Validation: Extensive experiments complement the theoretical developments. The results highlight practical improvements in convergence speed, validating the theoretical claims. These experiments span problems from synthetic data to real-world datasets, showcasing the versatility and effectiveness of the proposed methods.

Implications and Future Work

The contributions in this paper hold significant implications for stochastic optimization, particularly in machine learning and large-scale data contexts. The research opens paths for further exploration into:

Generalizations to Non-Quadratic Settings: While the current paper focuses on quadratic optimization problems, future work could investigate extensions to more general convex or even non-convex settings.
Applications to Deep Learning: Given the pivotal role of stochastic gradient descent in training deep networks, integrating these momentum techniques could lead to more efficient training regimes.
Further Exploration of Stochastic Momentum: Experimentation and theoretical scrutiny of stochastic momentum in diverse computational environments could unlock further efficiencies, especially in distributed and parallel computing settings.

In conclusion, this paper provides foundational insights and innovations in the use of momentum-based methods for stochastic optimization, presenting new opportunities for both theoretical exploration and practical implementation.

PDF Markdown

Related Papers

Find Related Papers