NSGD-MVR: Normalized SGD with Momentum VR

Updated 28 December 2025

The paper introduces NSGD-MVR, a novel algorithm leveraging gradient normalization, momentum, and variance reduction to achieve optimal rates in nonconvex settings.
It employs a momentum-transport trick and dual-sample update rules that cancel bias from Hessian terms, ensuring sharper convergence despite heavy-tailed stochastic noise.
Empirical evaluations on tasks like BERT pretraining and ResNet-50 show NSGD-MVR matches or slightly outperforms state-of-the-art optimizers in high-dimensional optimization.

Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR) is a stochastic first-order optimization algorithm for nonconvex objectives that combines gradient normalization, momentum, and a momentum-based variance reduction technique. NSGD-MVR is designed to optimize high-dimensional, nonconvex loss landscapes under possibly heavy-tailed stochastic noise, matching or approaching lower bounds on convergence rates in both bounded-variance and heavy-tailed noise regimes. It achieves convergence rates competitive with the best known dimension-independent guarantees on smooth nonconvex problems and delivers practical performance matching state-of-the-art optimizers across large-scale tasks such as deep network pretraining (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).

1. Problem Setting and Model Assumptions

The standard problem addressed by NSGD-MVR is

$\min_{w\in\mathbb R^d} F(w) = \mathbb E_{\xi\sim\mathcal D}[f(w, \xi)],$

with $F$ possibly nonconvex, and where the only access to $F$ is via unbiased stochastic gradients $\nabla f(w, \xi)$ .

The primary assumptions are:

$L$ -smoothness: $\|\nabla F(w) - \nabla F(u)\| \leq L \|w - u\|$ for all $w, u$ .
Second-order smoothness: $\|\nabla^2 F(w) - \nabla^2 F(u)\|_{\text{op}} \leq \rho \|w - u\|$ .
Lower-boundedness: $F(w)\ge 0$ or $F(x) \geq F^{\inf} > -\infty$ .
Noise regime: EITHER bounded variance ( $\mathbb E \|\nabla f(w,\xi) - \nabla F(w)\|^2 \leq \sigma^2$ ) OR generalized $p$ -th moment control: $\mathbb E \|\nabla f(w,\xi) - \nabla F(w)\|^p \leq \sigma_1^p$ for some $p \in (1,2]$ (heavy-tailed).
$q$ -Weak Average Smoothness ( $q$ -WAS): $\mathbb E \|\nabla f(w,\xi) - \nabla f(u,\xi)\|^q \leq \bar L^q\|w-u\|^q$ , $q\in[1,2]$ .

This setting allows NSGD-MVR to operate under controlled heavy-tailed stochasticity, and broader smoothness than is standard in the literature (Fradin et al., 21 Dec 2025).

2. Algorithmic Principles and Update Rules

NSGD-MVR implements a normalized stochastic gradient descent backbone augmented by a variance-reduced momentum update.

For $t\geq1$ , the key steps are:

Momentum-Variance-Reduced Update:

$m_t = \beta m_{t-1} + (1 - \beta) \nabla f(x_t, \xi_t),$

where the gradient is evaluated not at $w_t$ , but at a transported point

$x_t = w_t + \frac{\beta}{1 - \beta}(w_t - w_{t-1}),$

with $\beta = 1 - \alpha$ and typical $\alpha \approx T^{-4/7}$ for non-adaptive variants.

(Dual-sample MVR variant):

$g_t = (1 - \alpha) [g_{t-1} + \nabla f(x_t, \xi_t) - \nabla f(x_{t-1}, \xi_t)] + \alpha\, \nabla f(x_t, \xi_t)$

Gradient-normalized step:

$w_{t+1} = w_t - \eta \frac{m_t}{\|m_t\|}$

$x_{t+1} = x_t - \gamma \frac{g_t}{\|g_t\|}$

This momentum-transport trick cancels leading Hessian terms in bias and ensures that variance in the momentum buffer decays per-iteration, leading to sharper convergence, especially in the presence of non-Gaussian gradient noise (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).

3. Theoretical Guarantees and Complexity Results

The analysis of NSGD-MVR yields strong dimension-independent convergence guarantees matching optimal lower bounds:

Second-order smooth bounded variance regime: Under assumptions (A1)–(A4), with tuned

$\alpha = \min\left(1,\;R^{4/7}\rho^{2/7}\sigma^{-6/7} T^{-4/7}\right), \qquad \eta = \min\left( \sqrt{R/(TL)},\;R^{5/7}\rho^{1/7}\sigma^{-4/7}T^{-5/7}\right),$

the iterates satisfy

$\frac{1}{T} \sum_{t=1}^T \mathbb E\|\nabla F(w_t)\| \leq O\left( T^{-1/2} + \sigma^{13/7}T^{-4/7} + (R\rho)^{1/7}\sigma^{4/7}T^{-2/7}\right).$

To drive the norm of the gradient below $\epsilon$ , $T=O(\epsilon^{-3.5})$ suffices (Cutkosky et al., 2020).

Heavy-tailed ( $p$ -BCM) and $q$ -WAS regime: With

$N = O\left( (\sigma_1/\epsilon)^{p/(p-1)} + (\bar L\Delta)/\epsilon^2[1 + (\sigma_1/\epsilon)^{p/[q(p-1)]}] \right)$

stochastic gradient calls suffice to obtain an $\epsilon$ -stationary point in expectation, which exactly matches the lower bound up to constants for $p\in(1,2], q\in[1,2]$ (Fradin et al., 21 Dec 2025).

In particular, for $q=2$ , $p=2$ (bounded variance):

$N = O\big( \sigma_1^2/\epsilon^2 + \bar L\Delta/\epsilon^2 + \bar L\Delta \sigma_1/\epsilon^3 \big)$

recovering optimal results.

Proof techniques are based on one-step normalized descent lemmas and recursive bias contraction in the presence of normalization and momentum-transport, along with control of heavy-tailness via martingale inequalities (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).

4. Variance-Reduced Adaptive Variants

An adaptive version of NSGD-MVR uses two independent gradients per iteration to estimate the on-the-fly variance and adjust hyperparameters accordingly.

At each step:

Compute $G_{t+1} = G_t + \|\nabla f(x_t,\xi_t)-\nabla f(x_t,\xi'_t)\|^2 + g^2[(t+1)^{1/4}-t^{1/4}]$ .
Set per-iteration step-size and momentum decay according to $G_t$ :

$\eta_t = \frac{C}{[G_t^2 (t+1)^3]^{1/7}},\qquad \alpha_t = \frac{1}{t \eta_{t-1}^2 G_{t-1}},\qquad \beta_t = 1 - \alpha_t$

This adaptive method guarantees

$\frac{1}{T} \sum_{t=1}^T \mathbb E \|\nabla F(w_t)\| = \widetilde O((\sqrt{T})^{-1} + \sigma^{4/7} T^{-2/7})$

( $\widetilde O$ hides logarithmic factors), even without knowing $\sigma$ in advance (Cutkosky et al., 2020).

5. Implementation and Empirical Evaluation

Empirical studies implemented NSGD-MVR in large-scale settings:

Per-layer normalization: In practice, normalization is applied per layer (not on the full parameter vector for efficiency and stability).
BatchNorm and bias parameters: These are not normalized and use a step-size $10^3$ times the per-layer value.
Momentum and learning schedule: Fixed $\beta=0.9$ . For BERT, use a linear warm-up and decay, matching Adam; for ResNet-50, polynomial decay with 5-epoch warm-up as in LARS/LAMB.

Empirical benchmarks:

Task	Baseline	NSGD-MVR Setting	Result (Accuracy)
BERT pretraining	Adam ( $\beta_1=0.9,\beta_2=0.99$ )	Per-layer norm, $\beta=0.9$ , $\eta=10^{-3}$ , batch 256	Adam: 70.76%, NSGD-MVR: 70.91%
ResNet-50 on ImageNet	SGD+momentum	Per-layer norm, $\beta=0.9$ , base LR=0.01, batch 1024	SGD: 76.20%, NSGD-MVR: 76.37%

These results indicate that NSGD-MVR matches or slightly outperforms established optimizers such as Adam and momentum SGD on these large-scale tasks, supporting its practical viability (Cutkosky et al., 2020).

6. Comparative and Complexity-Theoretic Perspective

Algorithm	Noise Model	Min Stationarity Complexity	Requires Normalization?	VR Step?
SGD+momentum	$p$ -BCM	$O(\epsilon^{-(3p-2)/(p-1)})$	No	No
Minibatch NSGD	$q$ -WAS, $p$ -BCM	$O$ (worse in $\bar{L}$ for $p<2$ )	Yes	No
SGD-MVR (STORM)	$p=q=2$	Optimal	No	Yes
NSGD-MVR	$p$ -BCM, $q$ -WAS	Optimal across $p,q$	Yes	Yes

Omitting normalization degrades control of the stochastic gradient when $p<2$ or $q<2$ (i.e., under heavy tails or relaxed smoothness); omitting VR in the momentum means strictly slower rates (Fradin et al., 21 Dec 2025).

A plausible implication is that NSGD-MVR generalizes and interpolates between regimes where classical SGD, momentum, and modern variance reduction succeed, while matching oracle lower bounds in all settings covered.

7. Extensions and High-Probability Guarantees

Beyond expectation, a "Double-Clipped NSGD-MVR" variant introduces per-iteration clipping to deliver high-probability convergence rates under relaxed assumptions. Freedman-type and Rosenthal-type martingale arguments are used to establish sharp probability tails under $p$ -BCM noise and $q$ -WAS conditions (Fradin et al., 21 Dec 2025). This extension broadens the applicability of NSGD-MVR for robust optimization in settings with heavy-tailed or adversarial noise.

For comprehensive technical details, refer to "Momentum Improves Normalized SGD" (Cutkosky et al., 2020) and "Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise" (Fradin et al., 21 Dec 2025).

PDF Markdown Chat (Pro)

References (2)

Momentum Improves Normalized SGD (2020)

Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR).