Papers
Topics
Authors
Recent
2000 character limit reached

NSGD-MVR: Normalized SGD with Momentum VR

Updated 28 December 2025
  • The paper introduces NSGD-MVR, a novel algorithm leveraging gradient normalization, momentum, and variance reduction to achieve optimal rates in nonconvex settings.
  • It employs a momentum-transport trick and dual-sample update rules that cancel bias from Hessian terms, ensuring sharper convergence despite heavy-tailed stochastic noise.
  • Empirical evaluations on tasks like BERT pretraining and ResNet-50 show NSGD-MVR matches or slightly outperforms state-of-the-art optimizers in high-dimensional optimization.

Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR) is a stochastic first-order optimization algorithm for nonconvex objectives that combines gradient normalization, momentum, and a momentum-based variance reduction technique. NSGD-MVR is designed to optimize high-dimensional, nonconvex loss landscapes under possibly heavy-tailed stochastic noise, matching or approaching lower bounds on convergence rates in both bounded-variance and heavy-tailed noise regimes. It achieves convergence rates competitive with the best known dimension-independent guarantees on smooth nonconvex problems and delivers practical performance matching state-of-the-art optimizers across large-scale tasks such as deep network pretraining (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).

1. Problem Setting and Model Assumptions

The standard problem addressed by NSGD-MVR is

minwRdF(w)=EξD[f(w,ξ)],\min_{w\in\mathbb R^d} F(w) = \mathbb E_{\xi\sim\mathcal D}[f(w, \xi)],

with FF possibly nonconvex, and where the only access to FF is via unbiased stochastic gradients f(w,ξ)\nabla f(w, \xi).

The primary assumptions are:

  • LL-smoothness: F(w)F(u)Lwu\|\nabla F(w) - \nabla F(u)\| \leq L \|w - u\| for all w,uw, u.
  • Second-order smoothness: 2F(w)2F(u)opρwu\|\nabla^2 F(w) - \nabla^2 F(u)\|_{\text{op}} \leq \rho \|w - u\|.
  • Lower-boundedness: F(w)0F(w)\ge 0 or F(x)Finf>F(x) \geq F^{\inf} > -\infty.
  • Noise regime: EITHER bounded variance (Ef(w,ξ)F(w)2σ2\mathbb E \|\nabla f(w,\xi) - \nabla F(w)\|^2 \leq \sigma^2) OR generalized pp-th moment control: Ef(w,ξ)F(w)pσ1p\mathbb E \|\nabla f(w,\xi) - \nabla F(w)\|^p \leq \sigma_1^p for some p(1,2]p \in (1,2] (heavy-tailed).
  • qq-Weak Average Smoothness (qq-WAS): Ef(w,ξ)f(u,ξ)qLˉqwuq\mathbb E \|\nabla f(w,\xi) - \nabla f(u,\xi)\|^q \leq \bar L^q\|w-u\|^q, q[1,2]q\in[1,2].

This setting allows NSGD-MVR to operate under controlled heavy-tailed stochasticity, and broader smoothness than is standard in the literature (Fradin et al., 21 Dec 2025).

2. Algorithmic Principles and Update Rules

NSGD-MVR implements a normalized stochastic gradient descent backbone augmented by a variance-reduced momentum update.

For t1t\geq1, the key steps are:

  1. Momentum-Variance-Reduced Update:

mt=βmt1+(1β)f(xt,ξt),m_t = \beta m_{t-1} + (1 - \beta) \nabla f(x_t, \xi_t),

where the gradient is evaluated not at wtw_t, but at a transported point

xt=wt+β1β(wtwt1),x_t = w_t + \frac{\beta}{1 - \beta}(w_t - w_{t-1}),

with β=1α\beta = 1 - \alpha and typical αT4/7\alpha \approx T^{-4/7} for non-adaptive variants.

  1. (Dual-sample MVR variant):

gt=(1α)[gt1+f(xt,ξt)f(xt1,ξt)]+αf(xt,ξt)g_t = (1 - \alpha) [g_{t-1} + \nabla f(x_t, \xi_t) - \nabla f(x_{t-1}, \xi_t)] + \alpha\, \nabla f(x_t, \xi_t)

  1. Gradient-normalized step:

wt+1=wtηmtmtw_{t+1} = w_t - \eta \frac{m_t}{\|m_t\|}

or

xt+1=xtγgtgtx_{t+1} = x_t - \gamma \frac{g_t}{\|g_t\|}

This momentum-transport trick cancels leading Hessian terms in bias and ensures that variance in the momentum buffer decays per-iteration, leading to sharper convergence, especially in the presence of non-Gaussian gradient noise (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).

3. Theoretical Guarantees and Complexity Results

The analysis of NSGD-MVR yields strong dimension-independent convergence guarantees matching optimal lower bounds:

  • Second-order smooth bounded variance regime: Under assumptions (A1)–(A4), with tuned

α=min(1,  R4/7ρ2/7σ6/7T4/7),η=min(R/(TL),  R5/7ρ1/7σ4/7T5/7),\alpha = \min\left(1,\;R^{4/7}\rho^{2/7}\sigma^{-6/7} T^{-4/7}\right), \qquad \eta = \min\left( \sqrt{R/(TL)},\;R^{5/7}\rho^{1/7}\sigma^{-4/7}T^{-5/7}\right),

the iterates satisfy

1Tt=1TEF(wt)O(T1/2+σ13/7T4/7+(Rρ)1/7σ4/7T2/7).\frac{1}{T} \sum_{t=1}^T \mathbb E\|\nabla F(w_t)\| \leq O\left( T^{-1/2} + \sigma^{13/7}T^{-4/7} + (R\rho)^{1/7}\sigma^{4/7}T^{-2/7}\right).

To drive the norm of the gradient below ϵ\epsilon, T=O(ϵ3.5)T=O(\epsilon^{-3.5}) suffices (Cutkosky et al., 2020).

  • Heavy-tailed (pp-BCM) and qq-WAS regime: With

N=O((σ1/ϵ)p/(p1)+(LˉΔ)/ϵ2[1+(σ1/ϵ)p/[q(p1)]])N = O\left( (\sigma_1/\epsilon)^{p/(p-1)} + (\bar L\Delta)/\epsilon^2[1 + (\sigma_1/\epsilon)^{p/[q(p-1)]}] \right)

stochastic gradient calls suffice to obtain an ϵ\epsilon-stationary point in expectation, which exactly matches the lower bound up to constants for p(1,2],q[1,2]p\in(1,2], q\in[1,2] (Fradin et al., 21 Dec 2025).

In particular, for q=2q=2, p=2p=2 (bounded variance):

N=O(σ12/ϵ2+LˉΔ/ϵ2+LˉΔσ1/ϵ3)N = O\big( \sigma_1^2/\epsilon^2 + \bar L\Delta/\epsilon^2 + \bar L\Delta \sigma_1/\epsilon^3 \big)

recovering optimal results.

Proof techniques are based on one-step normalized descent lemmas and recursive bias contraction in the presence of normalization and momentum-transport, along with control of heavy-tailness via martingale inequalities (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).

4. Variance-Reduced Adaptive Variants

An adaptive version of NSGD-MVR uses two independent gradients per iteration to estimate the on-the-fly variance and adjust hyperparameters accordingly.

At each step:

  • Compute Gt+1=Gt+f(xt,ξt)f(xt,ξt)2+g2[(t+1)1/4t1/4]G_{t+1} = G_t + \|\nabla f(x_t,\xi_t)-\nabla f(x_t,\xi'_t)\|^2 + g^2[(t+1)^{1/4}-t^{1/4}].
  • Set per-iteration step-size and momentum decay according to GtG_t:

ηt=C[Gt2(t+1)3]1/7,αt=1tηt12Gt1,βt=1αt\eta_t = \frac{C}{[G_t^2 (t+1)^3]^{1/7}},\qquad \alpha_t = \frac{1}{t \eta_{t-1}^2 G_{t-1}},\qquad \beta_t = 1 - \alpha_t

This adaptive method guarantees

1Tt=1TEF(wt)=O~((T)1+σ4/7T2/7)\frac{1}{T} \sum_{t=1}^T \mathbb E \|\nabla F(w_t)\| = \widetilde O((\sqrt{T})^{-1} + \sigma^{4/7} T^{-2/7})

(O~\widetilde O hides logarithmic factors), even without knowing σ\sigma in advance (Cutkosky et al., 2020).

5. Implementation and Empirical Evaluation

Empirical studies implemented NSGD-MVR in large-scale settings:

  • Per-layer normalization: In practice, normalization is applied per layer (not on the full parameter vector for efficiency and stability).
  • BatchNorm and bias parameters: These are not normalized and use a step-size 10310^3 times the per-layer value.
  • Momentum and learning schedule: Fixed β=0.9\beta=0.9. For BERT, use a linear warm-up and decay, matching Adam; for ResNet-50, polynomial decay with 5-epoch warm-up as in LARS/LAMB.

Empirical benchmarks:

Task Baseline NSGD-MVR Setting Result (Accuracy)
BERT pretraining Adam (β1=0.9,β2=0.99\beta_1=0.9,\beta_2=0.99) Per-layer norm, β=0.9\beta=0.9, η=103\eta=10^{-3}, batch 256 Adam: 70.76%, NSGD-MVR: 70.91%
ResNet-50 on ImageNet SGD+momentum Per-layer norm, β=0.9\beta=0.9, base LR=0.01, batch 1024 SGD: 76.20%, NSGD-MVR: 76.37%

These results indicate that NSGD-MVR matches or slightly outperforms established optimizers such as Adam and momentum SGD on these large-scale tasks, supporting its practical viability (Cutkosky et al., 2020).

6. Comparative and Complexity-Theoretic Perspective

Algorithm Noise Model Min Stationarity Complexity Requires Normalization? VR Step?
SGD+momentum pp-BCM O(ϵ(3p2)/(p1))O(\epsilon^{-(3p-2)/(p-1)}) No No
Minibatch NSGD qq-WAS, pp-BCM OO (worse in Lˉ\bar{L} for p<2p<2) Yes No
SGD-MVR (STORM) p=q=2p=q=2 Optimal No Yes
NSGD-MVR pp-BCM, qq-WAS Optimal across p,qp,q Yes Yes

Omitting normalization degrades control of the stochastic gradient when p<2p<2 or q<2q<2 (i.e., under heavy tails or relaxed smoothness); omitting VR in the momentum means strictly slower rates (Fradin et al., 21 Dec 2025).

A plausible implication is that NSGD-MVR generalizes and interpolates between regimes where classical SGD, momentum, and modern variance reduction succeed, while matching oracle lower bounds in all settings covered.

7. Extensions and High-Probability Guarantees

Beyond expectation, a "Double-Clipped NSGD-MVR" variant introduces per-iteration clipping to deliver high-probability convergence rates under relaxed assumptions. Freedman-type and Rosenthal-type martingale arguments are used to establish sharp probability tails under pp-BCM noise and qq-WAS conditions (Fradin et al., 21 Dec 2025). This extension broadens the applicability of NSGD-MVR for robust optimization in settings with heavy-tailed or adversarial noise.


For comprehensive technical details, refer to "Momentum Improves Normalized SGD" (Cutkosky et al., 2020) and "Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise" (Fradin et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR).