Momentum-Aligned Gradient Masking (Magma)

Updated 19 February 2026

The paper presents Magma, an adaptive optimizer enhancement that uses momentum-gradient alignment to mask parameter updates, reducing validation perplexity by over 19% in LLM pre-training.
Magma employs block-wise Bernoulli masking with rescaling to inject curvature-dependent regularization, steering optimization toward flatter regions and improved convergence.
Empirical evaluations on LLaMA-style transformers confirm Magma’s efficiency, achieving faster convergence and lower error floors with minimal additional computational overhead.

Momentum-Aligned Gradient Masking (Magma) is an optimization scheme for large-scale neural network training, particularly designed for LLMs. Magma augments adaptive optimizers by randomly masking parameter updates in a manner guided by the alignment between momentum and instantaneous gradients. This method introduces implicit geometric regularization, enhances training stability and generalization, and maintains computational efficiency suitable for large-scale transformer architectures (Joo et al., 17 Feb 2026).

1. Motivation: Curvature-Dependent Regularization via Masked Updates

Conventional adaptive optimizers such as RMSProp and Adam employ dense preconditioning to mitigate curvature heterogeneity by adjusting learning rates per parameter or block. Magma’s foundation is the observation that randomly masking a subset of parameter updates at each step—while still updating momentum and second-moment statistics densely—can yield systematic improvements in optimizer behavior and model generalization. This masking can be formalized by partitioning parameters into $B$ blocks and, at each iteration, sampling independent Bernoulli masks $m_t^{(b)} \sim \mathrm{Bernoulli}(p)$ . Masked blocks forgo updates, and surviving updates are rescaled for unbiasedness: $\tilde\Delta_t^{(b)} = \frac{1}{p} m_t^{(b)} \Delta_t^{(b)}$ .

A second-order Taylor analysis demonstrates that this random masking injects an additional penalty $\frac{1-p}{2p} (\Delta_t^{(b)})^\top H_{bb} \Delta_t^{(b)}$ (with $H_{bb}$ the block Hessian), explicitly regularizing updates in directions of high curvature. This steers optimization toward flatter regions, functioning analogously to sharpness-aware methods but derived from stochastic masking rather than explicit regularizers (Joo et al., 17 Feb 2026).

2. SkipUpdate and the Masked RMSProp Baseline

The precursor to Magma, termed “SkipUpdate,” applies masking uniformly across blocks, irrespective of block-specific signals. In the baseline masked RMSProp, block-wise statistics are computed as follows:

First moment (momentum): $\mu_t^{(b)} = \beta_1 \mu_{t-1}^{(b)} + (1-\beta_1) g_t^{(b)}$
Second moment: $v_t^{(b)} = \beta_2 v_{t-1}^{(b)} + (1-\beta_2) (g_t^{(b)})^2$
Preconditioned update: $\Delta_t^{(b)} = \eta \mu_t^{(b)}/\sqrt{v_t^{(b)} + \epsilon}$ After masking and rescaling as above, the first moment is preserved, while the added curvature-regularization persists.

While effective, SkipUpdate does not differentiate between blocks according to their optimization state or noisiness, potentially discarding informative updates and retaining erratic ones.

3. Momentum-Gradient Alignment and Adaptive Masking

Magma extends SkipUpdate by modulating the masking probability based on the alignment between the current momentum estimate and the instantaneous gradient for each block. The alignment quality is measured by their cosine similarity:

$\mathrm{cossim}\left(\mu_t^{(b)}, g_t^{(b)}\right) = \frac{\langle \mu_t^{(b)}, g_t^{(b)} \rangle}{\|\mu_t^{(b)}\|\|g_t^{(b)}\|} \in [-1,1].$

High (positive) values indicate that current gradients reinforce accumulated momentum, signaling reliable descent directions; low or negative values suggest stochasticity or gradient oscillations.

4. Magma Update Rule and Algorithm

For each block $b$ and iteration $t$ , Magma computes an alignment-based score:

Alignment score: $\tilde{s}_t^{(b)} = \mathrm{sigmoid}(\mathrm{cossim}(\mu_t^{(b)}, g_t^{(b)})/\tau)$ , with $\tau$ as a temperature hyperparameter.
Exponential smoothing: $s_t^{(b)} = \alpha s_{t-1}^{(b)} + (1 - \alpha) \tilde{s}_t^{(b)}$ , typically with $\alpha=0.9$ .

A Bernoulli mask $m_t^{(b)} \sim \mathrm{Bernoulli}(p)$ is then sampled. The update applied is:

$\tilde\Delta_t^{(b)} = s_t^{(b)} m_t^{(b)} \Delta_t^{(b)}$

and the parameters are updated via $\theta_{t+1}^{(b)} = \theta_t^{(b)} - \tilde\Delta_t^{(b)}$ .

This process preferentially transmits updates with strong momentum-gradient alignment, adaptively suppressing those likely to be noisy or unproductive. Magma’s computational burden is minimal, requiring $O(B)$ additional dot-product and scalar operations per step (Joo et al., 17 Feb 2026).

5. Mechanisms Behind Improved Optimization

Analysis reveals that Magma, through alignment-aware masking, amplifies optimizer selectivity:

Progress is focused on blocks with coherent, low-noise gradient trajectories.
Noisy or high-curvature blocks, identified via poor alignment, are down-weighted or skipped. This mechanism reduces curvature-weighted noise, smooths the effective loss landscape, and enlarges the domain for stable learning-rate schedules. Empirical evidence demonstrates accelerated convergence and lower error floors in transformer architectures with high parameter heterogeneity.

6. Empirical Performance in LLM Pre-Training

Magma’s efficacy has been evaluated on LLaMA-style transformers using the C4 corpus across model scales from 60M to 1B parameters. Integrations with Adam, LaProp, and RMSProp were compared to C-Adam, SGG, Adafactor, APOLLO, SOAP, and matrix-prefactorized Muon. For 1B-parameter models, RMSProp+Magma achieved validation perplexity of 13.19, representing reductions of over 19% compared to Adam (16.35) and over 9% compared to Muon (14.52). These gains are substantial relative to established adaptive optimizers and are consistent across smaller model regimes.

Magma also demonstrated robust performance enhancements in mixture-of-experts (Nano MoE) pre-training and synthetic heavy-tailed, heterogeneous quadratic benchmarks, exhibiting superior stability and final accuracy (Joo et al., 17 Feb 2026).

7. Practical Considerations and Computational Overhead

Magma is implemented as a wrapper around existing adaptive optimizers. The additional computational cost is restricted to $O(B)$ extra inner-products, a scalar sigmoid evaluation, an exponential moving average, and a Bernoulli sample per block per iteration. Since $B$ (block count) is much less than the overall parameter count in modern LLMs, the overhead is negligible in practice. Magma requires no extra gradient computations or significant changes to data handling, supporting straightforward integration into large-scale training workflows.

Momentum-Aligned Gradient Masking (Magma) is positioned as a robust, theoretically motivated, and empirically validated drop-in enhancement to adaptive optimization for large-scale neural network models (Joo et al., 17 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Momentum-Aligned Gradient Masking (Magma).