Deconstructing What Makes a Good Optimizer for LLMs
The paper "Deconstructing What Makes a Good Optimizer for LLMs" by Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade provides an extensive evaluation of several optimization algorithms for LLM training. The authors contend that despite the prevailing preference for the Adam optimizer in the community, there is a lack of rigorous comparison among various optimizers under diverse conditions, such as different model sizes, hyperparameters, and architectures. This paper addresses that gap by performing comprehensive sweeps and presenting a granular analysis of the stability and performance of several optimizers.
Methodological Framework
The authors test several optimization algorithms, including SGD, Adafactor, Adam, and Lion, focusing on autoregressive LLMing. They conduct large-scale experiments across varying model sizes (150M to 1.2B parameters) and hyperparameters to ascertain the optimal performance and robustness of each optimizer concerning hyperparameter choices.
Key findings demonstrate that SGD generally underperforms compared to other optimizers both in terms of stability and final validation loss. Other algorithms, including Adam, Adafactor, and Lion, show comparable performance, challenging the assumption that Adam is universally superior. This equivalence holds across multiple scales and two transformer architecture variants, suggesting that decisions about which optimizer to use can be influenced by practical concerns such as computational efficiency and ease of implementation rather than strict performance metrics.
Dissecting Optimizer Components
To understand the underlying factors contributing to the performance and stability of these optimizers, the authors introduce two variants of Adam—Signum and Adalayer.
- Signum: Signum is a simplified version of Adam that uses signed momentum. The empirical results show that Signum can recover both the performance and the hyperparameter stability of Adam, indicating that a significant advantage of Adam comes from its usage of sign gradients.
- Adalayer: Adalayer is a layerwise variant of Adam designed to examine the impact of preconditioning. The investigation reveals that the benefits of preconditioning in Adam are most pronounced in the last layer and LayerNorm parameters. Surprisingly, other parameters can be trained effectively using vanilla SGD, provided that an adaptive mechanism is applied to the last layer and LayerNorms.
Implications
These findings have several crucial implications:
- Practical Optimization: The results suggest that in practical settings, the choice of optimizers should consider computational and memory constraints rather than assuming Adam's superiority.
- Optimizer Design: The comparable performance of Adafactor and Lion with Adam indicates that more efficient optimizers can be designed without significantly compromising performance. This aligns with the trend towards designing scalable and efficient training algorithms.
- Layerwise Adaptivity: The insight that most layers in a transformer model can be effectively trained with SGD, except the last layer and LayerNorm parameters, opens up new avenues for hybrid optimizer strategies. Such strategies could offer a trade-off between stability, performance, and computational efficiency.
Future Directions
Future research could explore the following avenues:
- Broader Architecture Sweep: Extending this analysis to various architectures and tasks (e.g., masked LLMing, fine-tuning) would provide a more holistic view of optimizer performance.
- 2D Hyperparameter Interactions: Investigating the interactions between multiple hyperparameters (e.g., batch size and learning rate) would yield deeper insights into the effective hyperparameter tuning strategies.
- Adaptive Metrics: Developing metrics to dynamically adjust hyperparameters based on training feedback could lead to more robust and adaptive optimization techniques.
In conclusion, by rigorously comparing multiple optimizers and dissecting their components, this paper challenges the prevailing notion about the superiority of Adam in LLM training. The insights from this paper can guide practical decisions in model training and stimulate future research in developing more efficient and adaptive optimization algorithms.