NSGD-MVR: Normalized SGD with Momentum VR
- The paper introduces NSGD-MVR, a novel algorithm leveraging gradient normalization, momentum, and variance reduction to achieve optimal rates in nonconvex settings.
- It employs a momentum-transport trick and dual-sample update rules that cancel bias from Hessian terms, ensuring sharper convergence despite heavy-tailed stochastic noise.
- Empirical evaluations on tasks like BERT pretraining and ResNet-50 show NSGD-MVR matches or slightly outperforms state-of-the-art optimizers in high-dimensional optimization.
Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR) is a stochastic first-order optimization algorithm for nonconvex objectives that combines gradient normalization, momentum, and a momentum-based variance reduction technique. NSGD-MVR is designed to optimize high-dimensional, nonconvex loss landscapes under possibly heavy-tailed stochastic noise, matching or approaching lower bounds on convergence rates in both bounded-variance and heavy-tailed noise regimes. It achieves convergence rates competitive with the best known dimension-independent guarantees on smooth nonconvex problems and delivers practical performance matching state-of-the-art optimizers across large-scale tasks such as deep network pretraining (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).
1. Problem Setting and Model Assumptions
The standard problem addressed by NSGD-MVR is
with possibly nonconvex, and where the only access to is via unbiased stochastic gradients .
The primary assumptions are:
- -smoothness: for all .
- Second-order smoothness: .
- Lower-boundedness: or .
- Noise regime: EITHER bounded variance () OR generalized -th moment control: for some (heavy-tailed).
- -Weak Average Smoothness (-WAS): , .
This setting allows NSGD-MVR to operate under controlled heavy-tailed stochasticity, and broader smoothness than is standard in the literature (Fradin et al., 21 Dec 2025).
2. Algorithmic Principles and Update Rules
NSGD-MVR implements a normalized stochastic gradient descent backbone augmented by a variance-reduced momentum update.
For , the key steps are:
- Momentum-Variance-Reduced Update:
where the gradient is evaluated not at , but at a transported point
with and typical for non-adaptive variants.
- (Dual-sample MVR variant):
- Gradient-normalized step:
or
This momentum-transport trick cancels leading Hessian terms in bias and ensures that variance in the momentum buffer decays per-iteration, leading to sharper convergence, especially in the presence of non-Gaussian gradient noise (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).
3. Theoretical Guarantees and Complexity Results
The analysis of NSGD-MVR yields strong dimension-independent convergence guarantees matching optimal lower bounds:
- Second-order smooth bounded variance regime: Under assumptions (A1)–(A4), with tuned
the iterates satisfy
To drive the norm of the gradient below , suffices (Cutkosky et al., 2020).
- Heavy-tailed (-BCM) and -WAS regime: With
stochastic gradient calls suffice to obtain an -stationary point in expectation, which exactly matches the lower bound up to constants for (Fradin et al., 21 Dec 2025).
In particular, for , (bounded variance):
recovering optimal results.
Proof techniques are based on one-step normalized descent lemmas and recursive bias contraction in the presence of normalization and momentum-transport, along with control of heavy-tailness via martingale inequalities (Cutkosky et al., 2020, Fradin et al., 21 Dec 2025).
4. Variance-Reduced Adaptive Variants
An adaptive version of NSGD-MVR uses two independent gradients per iteration to estimate the on-the-fly variance and adjust hyperparameters accordingly.
At each step:
- Compute .
- Set per-iteration step-size and momentum decay according to :
This adaptive method guarantees
( hides logarithmic factors), even without knowing in advance (Cutkosky et al., 2020).
5. Implementation and Empirical Evaluation
Empirical studies implemented NSGD-MVR in large-scale settings:
- Per-layer normalization: In practice, normalization is applied per layer (not on the full parameter vector for efficiency and stability).
- BatchNorm and bias parameters: These are not normalized and use a step-size times the per-layer value.
- Momentum and learning schedule: Fixed . For BERT, use a linear warm-up and decay, matching Adam; for ResNet-50, polynomial decay with 5-epoch warm-up as in LARS/LAMB.
Empirical benchmarks:
| Task | Baseline | NSGD-MVR Setting | Result (Accuracy) |
|---|---|---|---|
| BERT pretraining | Adam () | Per-layer norm, , , batch 256 | Adam: 70.76%, NSGD-MVR: 70.91% |
| ResNet-50 on ImageNet | SGD+momentum | Per-layer norm, , base LR=0.01, batch 1024 | SGD: 76.20%, NSGD-MVR: 76.37% |
These results indicate that NSGD-MVR matches or slightly outperforms established optimizers such as Adam and momentum SGD on these large-scale tasks, supporting its practical viability (Cutkosky et al., 2020).
6. Comparative and Complexity-Theoretic Perspective
| Algorithm | Noise Model | Min Stationarity Complexity | Requires Normalization? | VR Step? |
|---|---|---|---|---|
| SGD+momentum | -BCM | No | No | |
| Minibatch NSGD | -WAS, -BCM | (worse in for ) | Yes | No |
| SGD-MVR (STORM) | Optimal | No | Yes | |
| NSGD-MVR | -BCM, -WAS | Optimal across | Yes | Yes |
Omitting normalization degrades control of the stochastic gradient when or (i.e., under heavy tails or relaxed smoothness); omitting VR in the momentum means strictly slower rates (Fradin et al., 21 Dec 2025).
A plausible implication is that NSGD-MVR generalizes and interpolates between regimes where classical SGD, momentum, and modern variance reduction succeed, while matching oracle lower bounds in all settings covered.
7. Extensions and High-Probability Guarantees
Beyond expectation, a "Double-Clipped NSGD-MVR" variant introduces per-iteration clipping to deliver high-probability convergence rates under relaxed assumptions. Freedman-type and Rosenthal-type martingale arguments are used to establish sharp probability tails under -BCM noise and -WAS conditions (Fradin et al., 21 Dec 2025). This extension broadens the applicability of NSGD-MVR for robust optimization in settings with heavy-tailed or adversarial noise.
For comprehensive technical details, refer to "Momentum Improves Normalized SGD" (Cutkosky et al., 2020) and "Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise" (Fradin et al., 21 Dec 2025).