SGD-M: Polyak Momentum in Stochastic Optimization
- SGD-M is a stochastic optimization algorithm that integrates Polyak momentum to accelerate convergence and mitigate noise-induced variance.
- It leverages proximal gradient methods to handle composite objectives, achieving an O(1/√K) convergence rate even with small batch sizes.
- Empirical and theoretical analyses demonstrate that SGD-M outperforms vanilla SGD in stability and optimization efficiency for nonconvex problems.
Stochastic Gradient Descent with Polyak Momentum (SGD-M), also referred to as the Stochastic Heavy-Ball method (SHB), is a foundational algorithm in large-scale optimization and machine learning. It incorporates the classical momentum term introduced by Boris Polyak to accelerate convergence, improve variance reduction, and enhance stability in stochastic environments. The following exposition rigorously details the key algorithmic properties, convergence theory, variance dynamics, inexactness tolerance, and applications, as established in current research literature (Gao et al., 5 Mar 2024).
1. Problem Formulation and Stochastic Oracle
SGD-M is applied to composite optimization problems of the form
where
- is differentiable and -smooth (i.e., ),
- is a “simple” convex function or indicator, admitting efficient proximal mapping.
The algorithm accesses through a stochastic first-order oracle: which models potentially significant stochastic noise due to small batch sizes, as in modern deep learning practice.
2. SGD-M Algorithm and Proximal Dynamics
SGD-M maintains
- a sequence of iterates , and
- a momentum buffer .
At each iteration :
- Sample and form stochastic gradient .
- Update momentum:
with momentum coefficient (Polyak convention: moving-average style).
- Apply the proximal gradient step:
where is the step-size parameter. Explicitly, the prox-subproblem is:
This formulation generalizes vanilla SGD and Prox-SGD to nonconvex settings, managing bias and variance via Polyak momentum even when batch size is limited.
3. Variance Analysis and Lyapunov-Based Convergence
The convergence analysis introduces pivotal sequences:
- (expected suboptimality),
- (momentum buffer bias),
- (proximal step size squared).
The descent lemmas governing algorithmic progress are:
- Variance descent for buffer:
capturing the buffering of stochastic noise.
- Objective descent:
quantifying the trade-off between bias and objective decay.
- Gradient mapping:
A Lyapunov function (with ) is constructed to admit a contraction: Selecting and balances the bias-variance terms.
The main theorem asserts that, for ,
matching the optimal rate, independent of batch size—a pronounced advantage over vanilla Prox-SGD in high-noise settings.
4. Momentum-Induced Variance Reduction
A salient property of Polyak momentum in this context is explicit variance reduction. The buffer error decreases in tandem with the gradient norm: for the same as above. Thus, the bias introduced by the momentum estimate vanishes at the same rate as the optimality gap, without requiring large batches or post hoc variance-reduction corrections.
5. Inexact Proximal Mapping and Robustness
Practical deployments may solve the proximal step inexactly. To model this, an approximate stationarity criterion is imposed: where is the approximation error in the subproblem.
The result is robust: the Lyapunov contraction persists with an error term: If , the optimality is preserved. In practice, running SGD with small step-size for the subproblem (especially when is smooth) suffices for .
6. Numerical Experiments and Performance Characterization
Empirical validations are provided for two principal cases:
- Synthetic Quadratic: plus Gaussian noise. Standard SGD-Prox stalls at , failing to decrease the gradient as increases. SGD-M achieves the theoretically predicted gradient decay to arbitrary precision at fixed batch size.
- Image Classification (CIFAR-10): Employing as a regularizer derived from a “proxy” loss on a small subset, and as loss on the full data, prox-SGD-M demonstrates superior convergence in training loss and generalization accuracy compared to vanilla prox-SGD under the same mini-batch regime.
These results confirm that Polyak momentum fundamentally alters the noise dynamics; it underpins fast, statistically efficient optimization for nonconvex composite problems, even in settings with severe stochasticity.
7. Algorithmic Trade-Offs, Parameter Tuning, and Deployment Guidelines
Optimal performance of SGD-M hinges on careful parameter selection, as dictated by the analytical bounds:
- Step-size : Must satisfy and be scaled as for robust bias-variance trade-off.
- Momentum weight : Scale with problem smoothness, .
- Batch size: The theory allows arbitrary batch size; convergence rate does not deteriorate as batch size shrinks, though variance constant scales with from the stochastic oracle.
- Proximal accuracy: Sufficiently small in practical subproblem solving, achievable via small-step inner SGD to match theoretical guarantees.
For nonconvex composite minimization in high-noise regimes, these guidelines guarantee optimal minimization rates, buffer variance suppression, and resilience to subproblem approximation errors. The SGD-M framework is thus highly deployable in practical large-scale and deep learning settings.
8. Context, Impact, and Comparative Perspective
SGD-M, analyzed in (Gao et al., 5 Mar 2024), addresses longstanding issues in stochastic composite optimization:
- It breaks the batch size bottleneck of vanilla Prox-SGD, which cannot converge in nonconvex settings under high stochastic noise.
- Polyak momentum is shown to be a variance suppressor and stability inducer—empirically and theoretically—unlike generic momentum or acceleration schemes.
- SGD-M is robust to inexact computation, suitable for resource-constrained or large-scale deep architectures.
The results of (Gao et al., 5 Mar 2024) establish SGD-M as a theoretically optimized and practically potent primitive for modern stochastic, nonconvex composite optimization, with convergence guarantees matching deterministic rates, universal batch size tolerance, and improved generalization confirmed by experiments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free