The AdEMAMix Optimizer: Better, Faster, Older (2409.03137v2)

Published 5 Sep 2024 in cs.LG and stat.ML

Abstract: Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on LLMing and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a $1.3$B parameter AdEMAMix LLM trained on $101$B tokens performs comparably to an AdamW model trained on $197$B tokens ($+95\%$). Moreover, our method significantly slows-down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces AdEMAMix, an optimizer that innovatively combines fast and slow exponential moving averages to better leverage gradient memory.
Experimental results demonstrate that AdEMAMix outperforms AdamW in language modeling and Vision Transformer training, achieving similar performance with fewer tokens.
The study highlights reduced model forgetting with AdEMAMix, suggesting both practical resource savings and new insights into long-term gradient retention.

Overview of the AdEMAMix Optimizer

The paper "The AdEMAMix Optimizer: Better, Faster, Older" by Pagliardini, Ablin, and Grangier introduces AdEMAMix, a novel optimizer designed to address inherent limitations in momentum-based optimization methods commonly used in machine learning. This work builds on and modifies the Adam optimizer by incorporating a mixture of two Exponential Moving Averages (EMAs) to leverage information from both recent and older gradients more effectively.

Background

Momentum-based optimizers such as SGD with momentum (SGD+M), Adam, and AdamW have become standard tools for training deep neural networks in diverse areas like computer vision and natural language processing. These optimizers improve convergence rates by utilizing momenta to smooth out the optimization trajectory, overcoming issues like small local variations of the loss landscape. Traditional implementations of momentum use an EMA of past gradients, where a decay factor $\beta$ controls the contribution of older gradients.

Motivation

The core premise of the paper is that relying on a single EMA, and hence a single momentum term, is sub-optimal for optimization. Using a single EMA with a small $\beta$ emphasizes recent gradients but disregards useful information from older gradients. Conversely, a large $\beta$ retains more information from older gradients but slows responsiveness to recent changes in the loss landscape. This dichotomy calls for an optimization method that can balance the contributions from both recent and older gradients more equitably.

Methodology

AdEMAMix extends the Adam optimizer by introducing an additional EMA with a much larger decay value (e.g., $\beta_3=0.9999$ ). The optimizer's steps are influenced by a combination of a fast-changing EMA (similar to the conventional Adam momentum term) and a slow-changing EMA, weighted by a new parameter $\alpha$ :

$\begin{cases} _1^{(t)} = \beta_1 _1^{(t-1)} + (1-\beta_1) ^{(t)} , \quad \quad \hat{}_1^{(t)} = \frac{_1^{(t)}{1-\beta_1^t} \ _2^{(t)} &= \beta_3 _2^{(t-1)} + (1-\beta_3) ^{(t)} , \ ^{(t)} = \beta_2 ^{(t-1)} + (1-\beta_2) {^{(t)}^2 , \quad \quad \hat{}^{(t)} = \frac{^{(t)}{1-\beta_2^t} \ ^{(t)} &= ^{(t-1)} - \eta \left(\frac{\hat{}_1^{(t)} + \alpha _2^{(t)}}{\sqrt{\hat{}^{(t)} + \epsilon}} + \lambda ^{(t-1)} \right). \end{cases}$

Experimental Results

LLMing

The paper presents extensive experimental results on LLMs of varying sizes (110M, 335M, and 1.3B parameters). For each model size, AdEMAMix consistently outperformed AdamW across different token budgets. Notably, a 1.3B parameter AdEMAMix model trained on 101B tokens achieved comparable performance to an AdamW model trained on 197B tokens ( $+95\%$ ).

ViT Training

AdEMAMix was also tested on Vision Transformers (ViTs) trained on ImageNet-21k and ImageNet-1k datasets. The results highlighted that AdEMAMix consistently reduced training loss more efficiently than AdamW. When better training loss correlated with better test loss, AdEMAMix outperformed the AdamW baseline.

Analysis of Forgetting

A key observation was that AdEMAMix models attribute less forgetting to training batches compared to their AdamW counterparts. This slower pace in "forgetting" training data reveals a crucial advantage of leveraging older gradients, indicating persistent memory of training data over many iterations.

Implications

Practical Implications

The AdEMAMix optimizer's ability to train models to superior minima faster positions it as a more resource-efficient alternative to existing methods like Adam and AdamW. The significant reduction in required tokens for similar performance levels implies major computational savings.

Theoretical Implications

This work underscores the importance of further exploration into the functional forms used to leverage past gradients beyond simple EMAs. It raises interesting questions about the loss landscape and gradient consistency. Additionally, the distinct behavior in the context of model forgetting warrants deeper investigation into the long-term retention of gradient information.

Future Developments

Future avenues of research may involve:

Scalability analysis of AdEMAMix in large distributed training environments.
Extending this methodology to other types of optimization landscapes, such as those found in reinforcement learning and generative models.
Further empirical studies to understand the interaction between older gradient utilization and generalization properties.

In conclusion, AdEMAMix presents a substantial evolution in optimization strategies for deep learning, emphasizing the utility of older gradients that were previously underutilized. This enables faster training times and better convergence properties, making it a valuable addition to the optimizer toolbox in machine learning research and industry practices.

PDF Markdown

Related Papers

Tweets

https://twitter.com/PierreAblin/status/1832493670219555021

https://twitter.com/kalomaze/status/1833190181127565474

https://twitter.com/rohanpaul_ai/status/1832251956615680046

https://twitter.com/mrsiipa/status/1833141536453767465

https://twitter.com/iScienceLuvr/status/1831888449634496703

https://twitter.com/Ar_Douillard/status/1833460226026033453