Papers
Topics
Authors
Recent
2000 character limit reached

Double-Clipped NSGD-MVR

Updated 28 December 2025
  • The paper introduces double-clipped NSGD-MVR, an algorithm that extends NSGD-MVR with dual-level gradient clipping to secure high-probability convergence under heavy-tailed noise conditions.
  • It utilizes momentum-variance reduction combined with tailored clipping strategies to balance bias and variance while achieving near-optimal sample complexity under relaxed smoothness assumptions.
  • The framework provides rigorous high-probability convergence guarantees that improve on expectation-based methods, notably achieving optimal rates for canonical parameter settings (p=q=2).

Double-Clipped NSGD-MVR is an algorithmic framework designed for stochastic nonconvex optimization under heavy-tailed noise conditions, where the underlying gradient noise is characterized by bounded pp-th central moment (pp-BCM) for p(1,2]p \in (1,2]. The method extends the Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR) by employing dual-level gradient clipping and tailored high-probability analysis to deliver near-optimal complexity guarantees even in regimes where conventional assumptions do not hold. Double-Clipped NSGD-MVR provides explicit high-probability convergence rates under weak smoothness—relaxed mean-squared smoothness (qq-WAS, q[1,2]q\in[1,2]) and (q,δ)(q,\delta)-similarity—broadening applicability and improving upon previous lower bounds for stochastic optimization in heavy-tailed scenarios (Fradin et al., 21 Dec 2025).

1. Algorithmic Structure and Update Rules

The Double-Clipped NSGD-MVR algorithm operates on iterates xtx_t over TT steps, with primary parameters consisting of step-size γ\gamma, momentum parameter α(0,1)\alpha\in(0,1), clipping thresholds λ1,λ2>0\lambda_1,\lambda_2>0, and access to a stochastic gradient oracle f(x,ξ)\nabla f(x,\xi):

  • Gradient Clipping at λ2\lambda_2:

g~t=clip(f(xt,ξt),λ2)=min{1,λ2f(xt,ξt)}f(xt,ξt)\tilde g_t = \mathrm{clip}(\nabla f(x_t,\xi_t),\lambda_2) = \min\Big\{1,\,\frac{\lambda_2}{\|\nabla f(x_t,\xi_t)\|}\Big\}\nabla f(x_t,\xi_t)

  • Momentum-Variance-Reduction (MVR) Update with Difference Clipping at λ1\lambda_1:

Δˉt=clip(f(xt,ξt)f(xt1,ξt),λ1)\bar\Delta_t = \mathrm{clip}\big(\nabla f(x_t,\xi_t)-\nabla f(x_{t-1},\xi_t),\,\lambda_1\big)

gt=(1α)(gt1+Δˉt)+αg~tg_t = (1-\alpha)\big(g_{t-1}+\bar\Delta_t\big) + \alpha\,\tilde g_t

  • Gradient Normalization and Descent:

xt+1=xtγgtgtx_{t+1} = x_t - \gamma\frac{g_t}{\|g_t\|}

The composite update ensures stability under heavy-tailed noise by constraining the magnitude of both raw and difference gradients through dual clipping, followed by normalization.

2. Analytical Assumptions and Heavy-Tailed Regime

Double-Clipped NSGD-MVR is analyzed under the following formal model assumptions:

  • Bounded pp-th Central Moment (Heavy-Tailed Noise):

Eξ[f(x,ξ)]=F(x),Eξf(x,ξ)F(x)pσ1pE_\xi[\nabla f(x,\xi)] = \nabla F(x), \quad E_\xi\|\nabla f(x,\xi)-\nabla F(x)\|^p \le \sigma_1^p

where p(1,2]p\in(1,2] quantifies departure from sub-Gaussianity.

  • Relaxed Mean-Squared Smoothness (qq-WAS, q[1,2]q\in[1,2]):

Eξf(x,ξ)f(y,ξ)qLˉqxyqE_\xi\left\|\nabla f(x,\xi)-\nabla f(y,\xi)\right\|^q \le \bar L^q\,\|x-y\|^q

  • (q,δ)(q,\delta)-Similarity Plus L1L_1-Smoothness:

F(x)F(y)L1xy,Eξ[f(x,ξ)f(y,ξ)][F(x)F(y)]qδqxyq\|\nabla F(x)-\nabla F(y)\| \le L_1\|x-y\|,\qquad E_\xi\Big\|\big[\nabla f(x,\xi)-\nabla f(y,\xi)\big]-\big[\nabla F(x)-\nabla F(y)\big]\Big\|^q \le \delta^q\,\|x-y\|^q

This generalized smoothness expands the analysis to non-standard exponents and similarity regimes, surpassing prior works that restricted to q=2q=2 or necessitated stricter gradient/Hessian regularity.

3. High-Probability Convergence Guarantees

Double-Clipped NSGD-MVR supports high-probability convergence to stationary points. Setting

α=max{Tp2p1,Tpqp(2q+1)2q} λ2=4LˉΔ1σ1α1/p λ1=2γLˉα1/q γ=O(min{Δ1LˉT,αΔ1Lˉ,1αTln(T/β)Δ1Lˉ})\begin{align*} \alpha &= \max\left\{T^{-\frac{p}{2p-1}},\,T^{-\frac{pq}{p(2q+1)-2q}}\right\} \ \lambda_2 &= 4\sqrt{\bar L\Delta_1}\vee\sigma_1\alpha^{-1/p} \ \lambda_1 &= 2\gamma\bar L\,\alpha^{-1/q} \ \gamma &= O\left(\min\left\{\sqrt{\frac{\Delta_1}{\bar L T}},\,\alpha\sqrt{\frac{\Delta_1}{\bar L}},\,\frac{1}{\alpha T\ln(T/\beta)}\sqrt{\frac{\Delta_1}{\bar L}}\right\}\right) \end{align*}

with g0=0g_0=0, the algorithm achieves:

1Tt=0T1F(xt)=O(LˉΔ1+σ1Tmin{p12p1,q(p1)p(2q+1)2q}lnTβ)\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F(x_t)\| = O\left(\frac{\sqrt{\bar L\Delta_1}+\sigma_1}{T^{\min\left\{\frac{p-1}{2p-1},\,\frac{q(p-1)}{p(2q+1)-2q}\right\}}}\ln\frac{T}{\beta}\right)

with probability at least 1β1-\beta. To guarantee mintTF(xt)ϵ\min_{t\le T}\|\nabla F(x_t)\|\le\epsilon, one requires

T=O~(ϵmax(2p1p1,p(2q+1)2qq(p1)))T = \widetilde{O}\left(\epsilon^{-\max\left(\frac{2p-1}{p-1},\,\frac{p(2q+1)-2q}{q(p-1)}\right)}\right)

For the canonical parameter choice p=q=2p=q=2, this simplifies to T=O~(ϵ3)T=\widetilde{O}(\epsilon^{-3}).

4. Comparative Analysis with Expectation-Only NSGD-MVR

Expectation-only convergence for NSGD-MVR yields E[F(xˉ)]=O(ϵ)\mathbb{E}[\|\nabla F(\bar x)\|]=O(\epsilon) after T=O(ϵ2or-ϵ2pq(p1))T=O(\epsilon^{-2}-\text{or-}\epsilon^{-2-\frac{p}{q(p-1)}}) iterations. Double-Clipped NSGD-MVR strengthens this result, establishing the same rates (modulo logarithmic factors) with high probability, i.e., the bounds hold for “all tail events" up to probability 1β1-\beta rather than merely in expectation. The dual clipping parameters λ1,λ2\lambda_1,\lambda_2 must be set as functions of β\beta, incurring a ln(T/β)\ln(T/\beta) penalty, but enabling strong probabilistic control.

Algorithm Guarantee Type Required Iterations (TT)
NSGD-MVR Expectation O(ϵ2)O(\epsilon^{-2}) or O(ϵ2pq(p1))O(\epsilon^{-2-\frac{p}{q(p-1)}})
Double-Clipped NSGD-MVR High probability O~(ϵ3)\widetilde{O}(\epsilon^{-3}) (for p=q=2p=q=2), with log factors

5. Key Theoretical Lemmas and Analytical Techniques

The convergence analysis relies on several critical results:

  • Zero‐Chain Progress (Lemma A.4): Clipping in each momentum update can cause zero progress in certain directions (“zero-chain effect”), implying that, with constant probability, some coordinates remain unchanged, important for tail control.
  • Martingale Deviation Control (Lemma A.12): Freedman’s inequality is used on sums of clipped noise terms, θt\sum\theta_t and ωt\sum\omega_t, to manage the influence of heavy-tailed gradient fluctuations.
  • Bias-Variance Decomposition for MVR Error (Lemma B.3): Decomposes the momentum error into a geometrically decaying bias (in α\alpha) and residual terms bounded via γ,λ1,λ2\gamma, \lambda_1, \lambda_2.
  • Refined Descent Lemmas (Lemmas B.1–B.2): These address the interplay between gradient clipping, variance reduction, and progression towards stationary points under weak smoothness.

The parameter selection (α,γ,λ1,λ2\alpha,\gamma,\lambda_1,\lambda_2) is optimized to balance geometric decay, control of heavy-tail effects, and overall smoothness, leading to high-probability Lyapunov descent by Δ1\Delta_1 in O(ϵexponent)O(\epsilon^{-\text{exponent}}) steps.

6. Context and Significance in Stochastic Nonconvex Optimization

Double-Clipped NSGD-MVR represents a methodological advance for stochastic nonconvex optimization in heavy-tailed regimes. It includes and improves upon earlier analyses—both sample complexity lower bounds and algorithmic upper bounds—under broad smoothness and noise conditions. The dual clipping strategy, gradient normalization, and momentum-variance reduction yield a method that is both simple in implementation and rigorous in theoretical guarantees. Furthermore, the approach generalizes beyond standard L2L_2 smoothness, enabling applications to settings previously considered out-of-scope due to heavy-tailed, non-Gaussian gradient noise (Fradin et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Double-Clipped NSGD-MVR.