Papers
Topics
Authors
Recent
Search
2000 character limit reached

MARS-AdamW Optimizer for Scalable LLM Training

Updated 9 February 2026
  • MARS-AdamW is an optimizer that unifies AdamW momentum preconditioning with STORM-inspired variance reduction to improve convergence in large-scale neural network training.
  • The algorithm employs gradient clipping and stabilization techniques to manage variance and ensure robust training across different model scales.
  • Empirical evaluations on GPT-2 pretraining show MARS-AdamW significantly reduces tokens and wall-clock time compared to vanilla AdamW while maintaining competitive validation losses.

MARS-AdamW is an optimizer instance derived from the MARS (Make vAriance Reduction Shine) framework, which integrates preconditioned adaptive gradient methods with scalable stochastic variance reduction to improve the efficiency and convergence of large model training. Specifically, MARS-AdamW unifies AdamW-style momentum preconditioning with STORM-inspired variance-reduced gradient estimation and implements explicit stabilization via gradient clipping. Developed to address the observed gap between theoretical advances in variance reduction and their practical adoption in large-scale neural network training, MARS-AdamW demonstrates significant improvements in both token and time efficiency over vanilla AdamW for tasks such as LLM pretraining (Yuan et al., 2024).

1. Mathematical Derivation and Update Steps

MARS-AdamW builds on a variance-reduced preconditioned gradient framework. The core mechanism involves mixing the standard stochastic gradient with a STORM-style recursive momentum correction and then applying AdamW's adaptive moment preconditioning:

  • Variance-Reduced Gradient Estimator:

ct:=f(xt,ξt)+γtβ11β1[f(xt,ξt)f(xt1,ξt)]c_t := \nabla f(x_t, \xi_t) + \gamma_t \frac{\beta_1}{1-\beta_1}\bigl[\nabla f(x_t,\xi_t) - \nabla f(x_{t-1},\xi_t)\bigr]

Here, xtRdx_t \in \mathbb{R}^d is the parameter vector at iteration tt, ξt\xi_t denotes the stochastic mini-batch, γt[0,1]\gamma_t \in [0,1] modulates the variance reduction intensity (recovering vanilla AdamW at γt=0\gamma_t = 0 and full STORM at γt=1\gamma_t = 1).

  • Gradient Clipping:

c~t={ct/ct2,if ct2>1 ct,else\tilde{c}_t = \begin{cases} c_t / \|c_t\|_2, & \text{if } \|c_t\|_2 > 1\ c_t, & \text{else} \end{cases}

  • Moment Estimation:

mt=β1mt1+(1β1)c~t,    vt=β2vt1+(1β2)c~t2m_t = \beta_1 m_{t-1} + (1-\beta_1) \tilde{c}_t,\;\; v_t = \beta_2 v_{t-1} + (1-\beta_2) \tilde{c}_t^2

Bias corrections are then applied:

m^t=mt/(1β1t),    v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^t),\;\; \hat{v}_t = v_t / (1-\beta_2^t)

  • Preconditioned Parameter Update (with decoupled weight decay):

xt+1=xtηt(m^tv^t+ϵ+λxt)x_{t+1} = x_t - \eta_t\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} + \lambda x_t\right)

For γt0\gamma_t \equiv 0, the procedure exactly recovers AdamW. When γt1\gamma_t \equiv 1 and λ0\lambda \rightarrow 0, it recovers Adam combined with full STORM recursion.

2. Algorithmic Workflow and Pseudocode

MARS-AdamW's step-by-step execution is summarized below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Input: x₀ ∈ ℝᵈ, η_t, λ ≥ 0, Adam β₁, β₂, VR-scale γ_t, clip threshold =1, ε>0
Initialize: m₀ ← 0, v₀ ← 0, x₁ ← x₀
For t = 1, ..., T:
    1. Draw mini-batch ξ_t
    2. Compute gₜ = ∇f(xₜ, ξₜ)
    3. If t>1: δₜ = (β₁/(1−β₁)) · [gₜ − ∇f(xₜ₋₁, ξₜ)]
       Else: δₜ ← 0
    4. cₜ = gₜ + γ_t · δₜ
    5. If ∥cₜ∥₂ > 1: ṡcₜ = cₜ / ∥cₜ∥₂
       Else: ṡcₜ = cₜ
    6. mₜ = β₁·mₜ₋₁ + (1−β₁)·ṡcₜ
    7. vₜ = β₂·vₜ₋₁ + (1−β₂)·(ṡcₜ ⊙ ṡcₜ)
    8. ẑmₜ = mₜ / (1−β₁ᵗ), ẑvₜ = vₜ / (1−β₂ᵗ)
    9. xₜ₊₁ = xₜ − η_t·[ ẑmₜ / (√ẑvₜ + ε) + λ·xₜ ]
EndFor

Key distinguishing feature: lines 3–5 integrate STORM variance reduction into AdamW’s preconditioner.

3. Hyperparameters and Selection Properties

MARS-AdamW introduces several hyperparameters, some inherited from AdamW and others governing the VR component. The major hyperparameters and empirical guidelines are:

Hyperparameter Role in Optimization Recommended Setting / Range
ηt\eta_t Learning rate (schedule) Cosine decay with linear warmup; peak η\eta: 6e-4 (GPT-2 small), 3e-4 (medium), 2e-4 (large)
β1\beta_1 Moment-1 EMA [0.9,0.99][0.9,\,0.99], best 0.95\approx0.95
β2\beta_2 Moment-2 EMA [0.95,0.999][0.95,\,0.999], default $0.99$
γt\gamma_t VR scale Constant γ[0.01,0.1]\gamma \in [0.01,\,0.1]; $0.025$ robust
ϵ\epsilon Numerical stability 10810^{-8}
λ\lambda Weight decay $0.1$–$0.5$ (model-dependent)
Clipping Gradient stabilization 2\ell_2-norm threshold =1.0=1.0

Batch size (\approx480) and warm-up steps (\approx2k) follow large-model conventions. A constant VR scale is favored over schedule-based variants.

4. Theoretical Properties and Convergence

Under standard assumptions that FF is LL-smooth, stochastic gradients are unbiased with bounded variance, and preconditioners are positive definite, MARS-AdamW inherits and extends convergence properties of its constituent algorithms:

  • Incremental oracle complexity to find F(x)ϵ\|\nabla F(x)\| \leq \epsilon is O(ϵ3)\mathcal{O}(\epsilon^{-3}).
  • For γ1\gamma \rightarrow 1, the method recovers the nearly-optimal STORM convergence rate for non-convex smooth optimization, as formalized in Arjevani et al. (2023).
  • Full formal proof of convergence for the preconditioned variant is indicated as a direction for future work; empirically, no divergence or instability was observed in large-scale runs (Yuan et al., 2024).

5. Comparative Empirical Evaluation on GPT-2 Pretraining

Performance was assessed on GPT-2 models of varying scales (small: 125M, medium: 355M, large: 770M) using the OpenWebText dataset.

  • Token Efficiency: For GPT-2 large, 27B tokens were required by MARS-AdamW to reach validation loss 2.58, while AdamW required 50B tokens. Final validation losses: 2.53 (MARS-AdamW) vs. 2.56 (AdamW).
  • Wall-Clock Efficiency: The per-iteration cost of MARS-AdamW is approximately 5–10% higher than AdamW (due to the VR correction), but overall wall-clock time to achieve a given loss is reduced by 50–60%.
  • Ablation Studies: Little difference was observed between the exact and approximate VR correction (MARS vs. MARS-AP), suggesting MARS-AP is preferable when computational cost is a concern. A constant γ0.025\gamma\approx0.025 outperformed linear scheduling schemes in terms of final validation loss. MARS-AdamW also consistently exceeded MARS-Lion in performance under matched conditions.
Metric AdamW MARS-AdamW
Tokens to val loss 2.58 (GPT-2 large) $50$B $27$B
Final validation loss (GPT-2 large) $2.56$ $2.53$
Wall-clock speed to given loss Baseline $1.5$–2×2 \times faster
Hellaswag 0-shot acc. (GPT-2 medium/large) 39.5%/42.3%39.5\%/42.3\% 41.8%/44.2%41.8\%/44.2\%
  • Downstream Transfer: For zero-shot transfer on Hellaswag (after 50B tokens), MARS-AdamW outperformed AdamW for both medium and large GPT-2 models.

6. Practical Considerations and Implementation

MARS-AdamW requires two gradient evaluations per iteration in the default setting (for the VR correction), though the approximate variant (MARS-AP) mitigates this overhead with minimal loss in quality. The optimizer introduces a modest (5–10%) per-iteration computational overhead relative to AdamW, but the net speedup from reduced training steps is substantial for large models. Stability is ensured by 2\ell_2-norm gradient clipping and careful hyperparameter tuning. Empirically, the method demonstrated robust convergence and efficiency across all tested LLM-scale configurations (Yuan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MARS-AdamW.