On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
This presentation explores a counterintuitive finding in large language model optimization: randomly masking parameter updates during training can actually improve performance. We examine how Magma, a momentum-aligned gradient masking technique, achieves significant perplexity improvements over state-of-the-art optimizers like Adam by introducing geometric regularization through structured stochasticity. The talk covers the theoretical foundations of curvature-aware masking, empirical results across model scales from 60 million to 1 billion parameters, and the practical implications for training stability and generalization in modern foundation models.Script
What if skipping half your optimizer updates could actually make your model train better? This paper challenges the fundamental assumption that we must apply every gradient we compute, revealing that strategic masking of parameter updates yields surprising benefits for large language model training.
To understand why this matters, we need to examine the challenges facing modern optimizers.
Following from that challenge, the field has locked into a paradigm where dense gradients from backpropagation demand dense optimizer updates. But this leaves optimization vulnerable to instability from sharp local curvature and noisy gradients, with few alternatives explored.
The authors propose a radically different approach based on strategic randomness.
Building on this foundation, they introduce SkipUpdate, which applies block-wise Bernoulli masking to parameters. Through rigorous analysis, they prove this creates implicit geometric regularization that discourages movement in directions with high curvature, naturally guiding the optimizer toward flatter, more generalizable minima.
The empirical evidence is striking. Across model scales from 60 million to 1 billion parameters on the C4 benchmark, SkipUpdate with 50 percent masking probability consistently outperforms dense baseline optimizers. Notice how the performance gap actually widens at larger scales, precisely where optimization stability matters most.
Taking this further, Magma enhances the basic masking approach by computing block-wise cosine similarity between momentum and stochastic gradients. Updates aligned with historical momentum receive higher weight, while misaligned, noisy updates are probabilistically suppressed, all without requiring extra memory.
Transitioning to the full empirical validation, the results are comprehensive. On Llama 2 pre-training with 1 billion parameters, Magma achieves 19 percent lower perplexity than Adam and 9 percent better than Muon. In mixture-of-experts models with their notoriously unstable training dynamics, Magma maintains the strongest performance and robustness.
Perhaps most practically important is this hyperparameter sensitivity analysis. While Adam and C-Adam exhibit narrow stability windows requiring careful learning rate tuning, Magma maintains robust performance across a significantly wider range. This dramatically reduces the cost and fragility of hyperparameter search in production training runs.
Synthesizing these findings, this work fundamentally questions whether we truly need to apply every gradient we compute. By demonstrating that structured stochasticity can serve as powerful implicit regularization, it suggests an entire class of geometry-aware optimization methods waiting to be explored.
Strategic randomness in optimization may be less about throwing away information and more about listening to the geometry of the loss surface. Visit EmergentMind.com to explore this paper and discover more cutting-edge research reshaping how we train foundation models.