Double-Clipped NSGD-MVR
- The paper introduces double-clipped NSGD-MVR, an algorithm that extends NSGD-MVR with dual-level gradient clipping to secure high-probability convergence under heavy-tailed noise conditions.
- It utilizes momentum-variance reduction combined with tailored clipping strategies to balance bias and variance while achieving near-optimal sample complexity under relaxed smoothness assumptions.
- The framework provides rigorous high-probability convergence guarantees that improve on expectation-based methods, notably achieving optimal rates for canonical parameter settings (p=q=2).
Double-Clipped NSGD-MVR is an algorithmic framework designed for stochastic nonconvex optimization under heavy-tailed noise conditions, where the underlying gradient noise is characterized by bounded -th central moment (-BCM) for . The method extends the Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR) by employing dual-level gradient clipping and tailored high-probability analysis to deliver near-optimal complexity guarantees even in regimes where conventional assumptions do not hold. Double-Clipped NSGD-MVR provides explicit high-probability convergence rates under weak smoothness—relaxed mean-squared smoothness (-WAS, ) and -similarity—broadening applicability and improving upon previous lower bounds for stochastic optimization in heavy-tailed scenarios (Fradin et al., 21 Dec 2025).
1. Algorithmic Structure and Update Rules
The Double-Clipped NSGD-MVR algorithm operates on iterates over steps, with primary parameters consisting of step-size , momentum parameter , clipping thresholds , and access to a stochastic gradient oracle :
- Gradient Clipping at :
- Momentum-Variance-Reduction (MVR) Update with Difference Clipping at :
- Gradient Normalization and Descent:
The composite update ensures stability under heavy-tailed noise by constraining the magnitude of both raw and difference gradients through dual clipping, followed by normalization.
2. Analytical Assumptions and Heavy-Tailed Regime
Double-Clipped NSGD-MVR is analyzed under the following formal model assumptions:
- Bounded -th Central Moment (Heavy-Tailed Noise):
where quantifies departure from sub-Gaussianity.
- Relaxed Mean-Squared Smoothness (-WAS, ):
- -Similarity Plus -Smoothness:
This generalized smoothness expands the analysis to non-standard exponents and similarity regimes, surpassing prior works that restricted to or necessitated stricter gradient/Hessian regularity.
3. High-Probability Convergence Guarantees
Double-Clipped NSGD-MVR supports high-probability convergence to stationary points. Setting
with , the algorithm achieves:
with probability at least . To guarantee , one requires
For the canonical parameter choice , this simplifies to .
4. Comparative Analysis with Expectation-Only NSGD-MVR
Expectation-only convergence for NSGD-MVR yields after iterations. Double-Clipped NSGD-MVR strengthens this result, establishing the same rates (modulo logarithmic factors) with high probability, i.e., the bounds hold for “all tail events" up to probability rather than merely in expectation. The dual clipping parameters must be set as functions of , incurring a penalty, but enabling strong probabilistic control.
| Algorithm | Guarantee Type | Required Iterations () |
|---|---|---|
| NSGD-MVR | Expectation | or |
| Double-Clipped NSGD-MVR | High probability | (for ), with log factors |
5. Key Theoretical Lemmas and Analytical Techniques
The convergence analysis relies on several critical results:
- Zero‐Chain Progress (Lemma A.4): Clipping in each momentum update can cause zero progress in certain directions (“zero-chain effect”), implying that, with constant probability, some coordinates remain unchanged, important for tail control.
- Martingale Deviation Control (Lemma A.12): Freedman’s inequality is used on sums of clipped noise terms, and , to manage the influence of heavy-tailed gradient fluctuations.
- Bias-Variance Decomposition for MVR Error (Lemma B.3): Decomposes the momentum error into a geometrically decaying bias (in ) and residual terms bounded via .
- Refined Descent Lemmas (Lemmas B.1–B.2): These address the interplay between gradient clipping, variance reduction, and progression towards stationary points under weak smoothness.
The parameter selection () is optimized to balance geometric decay, control of heavy-tail effects, and overall smoothness, leading to high-probability Lyapunov descent by in steps.
6. Context and Significance in Stochastic Nonconvex Optimization
Double-Clipped NSGD-MVR represents a methodological advance for stochastic nonconvex optimization in heavy-tailed regimes. It includes and improves upon earlier analyses—both sample complexity lower bounds and algorithmic upper bounds—under broad smoothness and noise conditions. The dual clipping strategy, gradient normalization, and momentum-variance reduction yield a method that is both simple in implementation and rigorous in theoretical guarantees. Furthermore, the approach generalizes beyond standard smoothness, enabling applications to settings previously considered out-of-scope due to heavy-tailed, non-Gaussian gradient noise (Fradin et al., 21 Dec 2025).