- The paper introduces AdEMAMix, an optimizer that innovatively combines fast and slow exponential moving averages to better leverage gradient memory.
- Experimental results demonstrate that AdEMAMix outperforms AdamW in language modeling and Vision Transformer training, achieving similar performance with fewer tokens.
- The study highlights reduced model forgetting with AdEMAMix, suggesting both practical resource savings and new insights into long-term gradient retention.
Overview of the AdEMAMix Optimizer
The paper "The AdEMAMix Optimizer: Better, Faster, Older" by Pagliardini, Ablin, and Grangier introduces AdEMAMix, a novel optimizer designed to address inherent limitations in momentum-based optimization methods commonly used in machine learning. This work builds on and modifies the Adam optimizer by incorporating a mixture of two Exponential Moving Averages (EMAs) to leverage information from both recent and older gradients more effectively.
Background
Momentum-based optimizers such as SGD with momentum (SGD+M), Adam, and AdamW have become standard tools for training deep neural networks in diverse areas like computer vision and natural language processing. These optimizers improve convergence rates by utilizing momenta to smooth out the optimization trajectory, overcoming issues like small local variations of the loss landscape. Traditional implementations of momentum use an EMA of past gradients, where a decay factor β controls the contribution of older gradients.
Motivation
The core premise of the paper is that relying on a single EMA, and hence a single momentum term, is sub-optimal for optimization. Using a single EMA with a small β emphasizes recent gradients but disregards useful information from older gradients. Conversely, a large β retains more information from older gradients but slows responsiveness to recent changes in the loss landscape. This dichotomy calls for an optimization method that can balance the contributions from both recent and older gradients more equitably.
Methodology
AdEMAMix extends the Adam optimizer by introducing an additional EMA with a much larger decay value (e.g., β3=0.9999). The optimizer's steps are influenced by a combination of a fast-changing EMA (similar to the conventional Adam momentum term) and a slow-changing EMA, weighted by a new parameter α:
$\begin{cases}
_1^{(t)} = \beta_1 _1^{(t-1)} + (1-\beta_1) ^{(t)} , \quad \quad \hat{}_1^{(t)} = \frac{_1^{(t)}{1-\beta_1^t} \
_2^{(t)} &= \beta_3 _2^{(t-1)} + (1-\beta_3) ^{(t)} , \
^{(t)} = \beta_2 ^{(t-1)} + (1-\beta_2) {^{(t)}^2 , \quad \quad \hat{}^{(t)} = \frac{^{(t)}{1-\beta_2^t} \
^{(t)} &= ^{(t-1)} - \eta \left(\frac{\hat{}_1^{(t)} + \alpha _2^{(t)}}{\sqrt{\hat{}^{(t)} + \epsilon}} + \lambda ^{(t-1)} \right).
\end{cases}$
Experimental Results
LLMing
The paper presents extensive experimental results on LLMs of varying sizes (110M, 335M, and 1.3B parameters). For each model size, AdEMAMix consistently outperformed AdamW across different token budgets. Notably, a 1.3B parameter AdEMAMix model trained on 101B tokens achieved comparable performance to an AdamW model trained on 197B tokens (+95%).
ViT Training
AdEMAMix was also tested on Vision Transformers (ViTs) trained on ImageNet-21k and ImageNet-1k datasets. The results highlighted that AdEMAMix consistently reduced training loss more efficiently than AdamW. When better training loss correlated with better test loss, AdEMAMix outperformed the AdamW baseline.
Analysis of Forgetting
A key observation was that AdEMAMix models attribute less forgetting to training batches compared to their AdamW counterparts. This slower pace in "forgetting" training data reveals a crucial advantage of leveraging older gradients, indicating persistent memory of training data over many iterations.
Implications
Practical Implications
The AdEMAMix optimizer's ability to train models to superior minima faster positions it as a more resource-efficient alternative to existing methods like Adam and AdamW. The significant reduction in required tokens for similar performance levels implies major computational savings.
Theoretical Implications
This work underscores the importance of further exploration into the functional forms used to leverage past gradients beyond simple EMAs. It raises interesting questions about the loss landscape and gradient consistency. Additionally, the distinct behavior in the context of model forgetting warrants deeper investigation into the long-term retention of gradient information.
Future Developments
Future avenues of research may involve:
- Scalability analysis of AdEMAMix in large distributed training environments.
- Extending this methodology to other types of optimization landscapes, such as those found in reinforcement learning and generative models.
- Further empirical studies to understand the interaction between older gradient utilization and generalization properties.
In conclusion, AdEMAMix presents a substantial evolution in optimization strategies for deep learning, emphasizing the utility of older gradients that were previously underutilized. This enables faster training times and better convergence properties, making it a valuable addition to the optimizer toolbox in machine learning research and industry practices.