2000 character limit reached

SGD-M: Polyak Momentum in Stochastic Optimization

Updated 9 November 2025

SGD-M is a stochastic optimization algorithm that integrates Polyak momentum to accelerate convergence and mitigate noise-induced variance.
It leverages proximal gradient methods to handle composite objectives, achieving an O(1/√K) convergence rate even with small batch sizes.
Empirical and theoretical analyses demonstrate that SGD-M outperforms vanilla SGD in stability and optimization efficiency for nonconvex problems.

Stochastic Gradient Descent with Polyak Momentum (SGD-M), also referred to as the Stochastic Heavy-Ball method (SHB), is a foundational algorithm in large-scale optimization and machine learning. It incorporates the classical momentum term introduced by Boris Polyak to accelerate convergence, improve variance reduction, and enhance stability in stochastic environments. The following exposition rigorously details the key algorithmic properties, convergence theory, variance dynamics, inexactness tolerance, and applications, as established in current research literature (Gao et al., 5 Mar 2024).

1. Problem Formulation and Stochastic Oracle

SGD-M is applied to composite optimization problems of the form

$\min_{x \in \mathbb{R}^d}\; F(x) = f(x) + \psi(x),$

where

$f: \mathbb{R}^d \to \mathbb{R}$ is differentiable and $L$ -smooth (i.e., $\|\nabla f(x) - \nabla f(y)\| \leq L \|x-y\|$ ),
$\psi$ is a “simple” convex function or indicator, admitting efficient proximal mapping.

The algorithm accesses $f$ through a stochastic first-order oracle: $g(x;\xi) \text{ with } \mathbb{E}[g(x;\xi)] = \nabla f(x), \quad \mathbb{E}[\|g(x;\xi) - \nabla f(x)\|^2] \leq \sigma^2,$ which models potentially significant stochastic noise due to small batch sizes, as in modern deep learning practice.

2. SGD-M Algorithm and Proximal Dynamics

SGD-M maintains

a sequence of iterates $\{x_k\}$ , and
a momentum buffer $\{m_k\}$ .

At each iteration $k$ :

Sample $\xi_k$ and form stochastic gradient $g_k = g(x_k; \xi_k)$ .
Update momentum:

$m_k = (1-\gamma) m_{k-1} + \gamma g_k,$

with momentum coefficient $\gamma \in (0,1)$ (Polyak convention: moving-average style).

Apply the proximal gradient step:

$x_{k+1} = \mathrm{prox}_{\psi/M}(x_k - \frac{1}{M} m_k),$

where $M > 0$ is the step-size parameter. Explicitly, the prox-subproblem is:

$x_{k+1} = \arg\min_{x} \{ \langle m_k, x - x_k \rangle + \psi(x) + \frac{M}{2} \|x - x_k\|^2 \}.$

This formulation generalizes vanilla SGD and Prox-SGD to nonconvex settings, managing bias and variance via Polyak momentum even when batch size is limited.

3. Variance Analysis and Lyapunov-Based Convergence

The convergence analysis introduces pivotal sequences:

$F_k := \mathbb{E}[F(x_k) - F^*]$ (expected suboptimality),
$\Delta_k := \mathbb{E}[\|m_k - \nabla f(x_k)\|^2]$ (momentum buffer bias),
$R_k := \mathbb{E}[\|x_{k+1} - x_k\|^2]$ (proximal step size squared).

The descent lemmas governing algorithmic progress are:

Variance descent for buffer:

$\Delta_{k+1} \leq (1-\gamma) \Delta_k + \frac{L^2}{\gamma} R_k + \gamma^2 \sigma^2,$

capturing the buffering of stochastic noise.

Objective descent:

$F_{k+1} \leq F_k - \frac{M - L}{4} R_k + \frac{\Delta_k}{M - L},$

quantifying the trade-off between bias and objective decay.

Gradient mapping:

$R_k \geq \frac{\mathbb{E}[\|\nabla F(x_{k+1})\|^2]}{3(M^2 + L^2)} - \frac{\Delta_k}{M^2 + L^2}.$

A Lyapunov function $\Phi_k = F_k + a \Delta_k$ (with $a = \Theta(1/L)$ ) is constructed to admit a contraction: $\Phi_{k+1} \leq \Phi_k - \frac{1}{48 M} \mathbb{E}[\|\nabla F(x_{k+1})\|^2] + \frac{27 L}{4 M^2} \sigma^2.$ Selecting $M = 4L + 3\sqrt{2K L \sigma^2 / \Phi_0}$ and $\gamma = 3L/(M-L)$ balances the bias-variance terms.

The main theorem asserts that, for $K = O(L \Phi_0 \sigma^2 / \varepsilon^2 + L \Phi_0 / \varepsilon)$ ,

$\mathbb{E}\left[ \|\nabla F(x_\kappa)\|^2 \right] \leq \varepsilon, \quad \text{where } \kappa \sim \text{Uniform}\{1, ... K\},$

matching the optimal $O(1/\sqrt{K})$ rate, independent of batch size—a pronounced advantage over vanilla Prox-SGD in high-noise settings.

4. Momentum-Induced Variance Reduction

A salient property of Polyak momentum in this context is explicit variance reduction. The buffer error $\Delta_k$ decreases in tandem with the gradient norm: $\mathbb{E}\left[ \|m_\kappa - \nabla f(x_\kappa)\|^2 \right] \leq \varepsilon,$ for the same $K$ as above. Thus, the bias introduced by the momentum estimate vanishes at the same rate as the optimality gap, without requiring large batches or post hoc variance-reduction corrections.

5. Inexact Proximal Mapping and Robustness

Practical deployments may solve the proximal step inexactly. To model this, an approximate stationarity criterion is imposed: $\mathbb{E}\left[ \|\nabla \Omega_k(x_{k+1})\|^2 \right] \leq \frac{M^2}{16} \mathbb{E}\left[ \|x_{k+1} - x_k\|^2 \right] + S_k,$ where $S_k$ is the approximation error in the subproblem.

The result is robust: the Lyapunov contraction persists with an error term: $\mathbb{E}\left[ \|\nabla F(x_\kappa)\|^2 \right] \leq \varepsilon/2 + \frac{8}{K} \sum_{k=0}^{K-1} S_k.$ If $S_k \leq \varepsilon/16$ , the optimality is preserved. In practice, running SGD with small step-size for the subproblem (especially when $\psi$ is smooth) suffices for $S_k \to 0$ .

6. Numerical Experiments and Performance Characterization

Empirical validations are provided for two principal cases:

Synthetic Quadratic: $f(x) = \frac{1}{2} L \|x\|^2$ plus Gaussian noise. Standard SGD-Prox stalls at $O(\sigma^2)$ , failing to decrease the gradient as $K$ increases. SGD-M achieves the theoretically predicted $O(1/\sqrt{K})$ gradient decay to arbitrary precision at fixed batch size.
Image Classification (CIFAR-10): Employing $\psi$ as a regularizer derived from a “proxy” loss on a small subset, and $f$ as loss on the full data, prox-SGD-M demonstrates superior convergence in training loss and generalization accuracy compared to vanilla prox-SGD under the same mini-batch regime.

These results confirm that Polyak momentum fundamentally alters the noise dynamics; it underpins fast, statistically efficient optimization for nonconvex composite problems, even in settings with severe stochasticity.

7. Algorithmic Trade-Offs, Parameter Tuning, and Deployment Guidelines

Optimal performance of SGD-M hinges on careful parameter selection, as dictated by the analytical bounds:

Step-size $M$ : Must satisfy $M > 4L$ and be scaled as $M = 4L + 3\sqrt{2K L \sigma^2 / \Phi_0}$ for robust bias-variance trade-off.
Momentum weight $\gamma$ : Scale with problem smoothness, $\gamma = 3L/(M-L)$ .
Batch size: The theory allows arbitrary batch size; convergence rate does not deteriorate as batch size shrinks, though variance constant scales with $\sigma^2$ from the stochastic oracle.
Proximal accuracy: Sufficiently small $S_k$ in practical subproblem solving, achievable via small-step inner SGD to match theoretical guarantees.

For nonconvex composite minimization in high-noise regimes, these guidelines guarantee optimal minimization rates, buffer variance suppression, and resilience to subproblem approximation errors. The SGD-M framework is thus highly deployable in practical large-scale and deep learning settings.

8. Context, Impact, and Comparative Perspective

SGD-M, analyzed in (Gao et al., 5 Mar 2024), addresses longstanding issues in stochastic composite optimization:

It breaks the batch size bottleneck of vanilla Prox-SGD, which cannot converge in nonconvex settings under high stochastic noise.
Polyak momentum is shown to be a variance suppressor and stability inducer—empirically and theoretically—unlike generic momentum or acceleration schemes.
SGD-M is robust to inexact computation, suitable for resource-constrained or large-scale deep architectures.

The results of (Gao et al., 5 Mar 2024) establish SGD-M as a theoretically optimized and practically potent primitive for modern stochastic, nonconvex composite optimization, with convergence guarantees matching deterministic rates, universal batch size tolerance, and improved generalization confirmed by experiments.

PDF Markdown Chat (Pro)

References (1)

Non-convex Stochastic Composite Optimization with Polyak Momentum (2024)

Follow Topic

Get notified by email when new papers are published related to Stochastic Gradient Descent with Polyak Momentum (SGD-M).