Abstract: Understanding the remarkable efficacy of Adam when training transformer-based LLMs has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1,300 LLMs across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal. Beyond robust performance, this choice affords new theoretical insights, highlights the "secret sauce" on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients-one that arises from a mean-field Gaussian variational inference perspective.
Summary
The paper demonstrates that setting β₁ = β₂ in Adam simplifies hyperparameter tuning while consistently achieving near-optimal performance across diverse language modeling tasks.
It utilizes extensive experiments on over 1,300 models across various datasets and scales to compare Adam against simpler variants like Signum and RMSprop.
The findings link Adam's update rule to adaptive variance estimation, explaining its robustness in managing heterogeneous optimization landscapes in transformers.
Adam has remained the optimizer of choice for training large LMs, particularly transformers, despite the development of many alternative methods. This paper investigates the reasons behind Adam's efficacy by conducting extensive empirical studies and theoretical analysis, focusing on comparisons with simplified variants.
The core findings are derived from training over 1,300 LLMs on different datasets (SlimPajama, Fineweb) and scales (160M, 410M parameters), with detailed hyperparameter tuning.
Key empirical results include:
Adam consistently outperforms established simplified variants like Signum (SignSGD with momentum) and RMSprop across various language modeling tasks, even after extensive hyperparameter tuning for each method. The performance gap, measured by validation perplexity, is substantial, especially at longer sequence lengths.
A notable empirical finding is that constraining Adam's momentum parameters such that β1=β2 results in performance that is consistently near-optimal or optimal across a wide range of experimental settings (different batch sizes, sequence lengths, data sources, and model scales). This suggests that for practical LM training, Adam can effectively be treated as having a single momentum parameter.
The optimal values for β1 and β2 in Adam are empirically found to be correlated, with better performance observed when they are closer to each other.
Ablation studies show that other common components of Adam, such as the epsilon term for numerical stability, moving average initialization, and bias correction, have minimal impact on the final performance compared to the choice of β values.
The theoretical contribution focuses on the empirically supported simplification β1=β2=β. Under this constraint, the Adam update direction can be precisely reformulated to highlight the first momentum term (mk=EMAβ[gk]) and a term proportional to the exponential moving average of the squared deviation between the previous momentum and the current gradient (σk=βEMAβ[(mk−1−gk)2]).
The paper demonstrates that this specific formulation allows for a novel interpretation:
Adam with β1=β2 acts as a natural online algorithm for estimating the mean (mk) and variance (σk) of the gradients. This estimation process arises from a mean-field Gaussian variational inference perspective, where the objective is to find parameters (m,σ2) for a Gaussian distribution that best fits the new gradient sample while staying close to the previous estimate (mk,σk).
This interpretation linking Adam to variance estimation is shown to be precise only when β1=β2.
The Adam update direction can then be viewed as an adaptive mollified version of SignSGD, where the mollification factor 1+σk2/mk2 depends on the estimated noise-to-signal ratio (mk/σk). This implies that Adam adaptively shrinks the step size when the estimated gradient noise is high relative to the signal, and vice-versa.
A toy quadratic example, designed to mimic the heterogeneous optimization landscapes found in transformers (where different parameter groups have vastly different eigenvalue magnitudes), is used to validate these insights. The example shows that Adam (with β1=β2) significantly outperforms Signum and SGD on such heterogeneous landscapes. The variance term σk in this example adapts differently across different parameter groups, illustrating its role in handling landscape heterogeneity, which cannot be replicated by simple clipping or a fixed epsilon.
Based on these findings, the paper recommends adopting Adam with β1=β2 as a robust and simplified default setting for training LLMs at similar data and parameter scales. This reduces the hyperparameter search space while preserving near-optimal performance. The standard setting of (0.9,0.999) for (β1,β2) is shown to be suboptimal in some configurations, while settings around (0.95,0.95) are empirically strong.
The paper concludes by noting limitations, including the grid-dependency of hyperparameter search and the fact that the theory explains the estimation of mean and variance but not the specific quotient structure of the Adam update. Nevertheless, the work provides strong empirical evidence for Adam's advantage and a principled explanation for the effectiveness of setting β1=β2 rooted in online gradient variance estimation and adaptive trust region modulation.