Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

In Search of Adam's Secret Sauce (2505.21829v1)

Published 27 May 2025 in cs.LG

Abstract: Understanding the remarkable efficacy of Adam when training transformer-based LLMs has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1,300 LLMs across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal. Beyond robust performance, this choice affords new theoretical insights, highlights the "secret sauce" on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients-one that arises from a mean-field Gaussian variational inference perspective.

Summary

  • The paper demonstrates that setting β₁ = β₂ in Adam simplifies hyperparameter tuning while consistently achieving near-optimal performance across diverse language modeling tasks.
  • It utilizes extensive experiments on over 1,300 models across various datasets and scales to compare Adam against simpler variants like Signum and RMSprop.
  • The findings link Adam's update rule to adaptive variance estimation, explaining its robustness in managing heterogeneous optimization landscapes in transformers.

Adam has remained the optimizer of choice for training large LMs, particularly transformers, despite the development of many alternative methods. This paper investigates the reasons behind Adam's efficacy by conducting extensive empirical studies and theoretical analysis, focusing on comparisons with simplified variants.

The core findings are derived from training over 1,300 LLMs on different datasets (SlimPajama, Fineweb) and scales (160M, 410M parameters), with detailed hyperparameter tuning.

Key empirical results include:

  • Adam consistently outperforms established simplified variants like Signum (SignSGD with momentum) and RMSprop across various language modeling tasks, even after extensive hyperparameter tuning for each method. The performance gap, measured by validation perplexity, is substantial, especially at longer sequence lengths.
  • A notable empirical finding is that constraining Adam's momentum parameters such that β1=β2\beta_1 = \beta_2 results in performance that is consistently near-optimal or optimal across a wide range of experimental settings (different batch sizes, sequence lengths, data sources, and model scales). This suggests that for practical LM training, Adam can effectively be treated as having a single momentum parameter.
  • The optimal values for β1\beta_1 and β2\beta_2 in Adam are empirically found to be correlated, with better performance observed when they are closer to each other.
  • Ablation studies show that other common components of Adam, such as the epsilon term for numerical stability, moving average initialization, and bias correction, have minimal impact on the final performance compared to the choice of β\beta values.

The theoretical contribution focuses on the empirically supported simplification β1=β2=β\beta_1 = \beta_2 = \beta. Under this constraint, the Adam update direction can be precisely reformulated to highlight the first momentum term (mk=EMAβ[gk]m_k = EMA_\beta[g_k]) and a term proportional to the exponential moving average of the squared deviation between the previous momentum and the current gradient (σk=βEMAβ[(mk1gk)2]\sigma_k = \beta EMA_\beta[(m_{k-1} - g_k)^2]).

The paper demonstrates that this specific formulation allows for a novel interpretation:

  • Adam with β1=β2\beta_1 = \beta_2 acts as a natural online algorithm for estimating the mean (mkm_k) and variance (σk\sigma_k) of the gradients. This estimation process arises from a mean-field Gaussian variational inference perspective, where the objective is to find parameters (m,σ2)(m, \sigma^2) for a Gaussian distribution that best fits the new gradient sample while staying close to the previous estimate (mk,σk)(m_k, \sigma_k).
  • This interpretation linking Adam to variance estimation is shown to be precise only when β1=β2\beta_1 = \beta_2.
  • The Adam update direction can then be viewed as an adaptive mollified version of SignSGD, where the mollification factor 1+σk2/mk2\sqrt{1 + \sigma_k^2 / m_k^2} depends on the estimated noise-to-signal ratio (mk/σkm_k/\sigma_k). This implies that Adam adaptively shrinks the step size when the estimated gradient noise is high relative to the signal, and vice-versa.

A toy quadratic example, designed to mimic the heterogeneous optimization landscapes found in transformers (where different parameter groups have vastly different eigenvalue magnitudes), is used to validate these insights. The example shows that Adam (with β1=β2\beta_1=\beta_2) significantly outperforms Signum and SGD on such heterogeneous landscapes. The variance term σk\sigma_k in this example adapts differently across different parameter groups, illustrating its role in handling landscape heterogeneity, which cannot be replicated by simple clipping or a fixed epsilon.

Based on these findings, the paper recommends adopting Adam with β1=β2\beta_1 = \beta_2 as a robust and simplified default setting for training LLMs at similar data and parameter scales. This reduces the hyperparameter search space while preserving near-optimal performance. The standard setting of (0.9,0.999)(0.9, 0.999) for (β1,β2)(\beta_1, \beta_2) is shown to be suboptimal in some configurations, while settings around (0.95,0.95)(0.95, 0.95) are empirically strong.

The paper concludes by noting limitations, including the grid-dependency of hyperparameter search and the fact that the theory explains the estimation of mean and variance but not the specific quotient structure of the Adam update. Nevertheless, the work provides strong empirical evidence for Adam's advantage and a principled explanation for the effectiveness of setting β1=β2\beta_1 = \beta_2 rooted in online gradient variance estimation and adaptive trust region modulation.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 314 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube