Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SOAP: Improving and Stabilizing Shampoo using Adam (2409.11321v1)

Published 17 Sep 2024 in cs.LG and cs.AI

Abstract: There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on LLM pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.

Citations (5)

Summary

  • The paper introduces SOAP, an optimizer that runs Adam in Shampoo's eigenbasis to accelerate convergence for large-scale language models.
  • It establishes a theoretical link between a modified Shampoo and Adafactor, reducing complexity with only one extra hyperparameter.
  • Empirical results show SOAP reduces iterations by over 40% and wall-clock time by 35% compared to AdamW and Shampoo.

Overview of SOAP: Improving and Stabilizing Shampoo using Adam

The paper "SOAP: Improving and Stabilizing Shampoo using Adam" tackles the issue of optimization efficiency in the training of large-scale machine learning models, particularly LLMs. The authors propose an innovative optimization algorithm termed SOAP (ShampoO with Adam in the Preconditioner's eigenbasis), which builds on the Shampoo optimizer by integrating it with ideas from Adam and Adafactor. SOAP is notably more efficient both in terms of computational time and steps required to reach convergence.

Key Contributions

  1. Theoretical Connections Between Optimizers: The authors derive an equivalence between a modified version of Shampoo (with a power of 1/2) and the Adafactor algorithm when the latter is utilized in the eigenbasis of the Shampoo preconditioner. This conceptual leap allows the authors to reframe the optimization problem and utilize Adam's running averages within a rotated space.
  2. Introduction of SOAP Algorithm: Based on the above theoretical insight, the SOAP algorithm is introduced. It runs Adam in the eigenbasis provided by Shampoo's preconditioner, making it computationally feasible while maintaining fewer hyperparameters compared to Shampoo. SOAP introduces only a single additional hyperparameter (preconditioning frequency) relative to Adam, streamlining the tuning process.
  3. Empirical Evaluation: The paper presents rigorous empirical evaluations demonstrating that SOAP performs better than both Shampoo and AdamW on LLM pre-training. For 360 million and 660 million parameter models, SOAP reduces the number of iterations by over 40% and wall-clock time by over 35% compared to AdamW. In contrast to Shampoo, it offers approximately 20% improvements in both these metrics.

Implications

Practical Implications: The practical benefits of adopting SOAP are multifold. In high-performance computing environments where training large-scale LLMs involves significant time and cost, the efficiency gains offered by SOAP can translate into substantial savings and faster model iteration cycles. The efficient handling of computational overhead ensures that the optimizer is suitable for real-world applications where resource constraints are crucial.

Theoretical Implications: The equivalence shown between Shampoo with power 1/2 and Adafactor in the preconditioner's eigenbasis bridges a notable gap between higher-order preconditioning methods and memory-efficient approximations. This lays the groundwork for potential new optimizers that leverage properties of multiple existing algorithms, thereby enriching the optimization theory landscape.

Future Directions

  1. Scalability and Precision: Further research could explore the scalability of SOAP to even larger models and datasets. Incorporating lower-precision arithmetic for storing and updating preconditioners, as hinted in the paper, can help improve both time and space efficiency.
  2. Generalization to Other Domains: While the paper focuses on LLMs, extending the applicability of SOAP to other domains such as image recognition or reinforcement learning can be a valuable next step.
  3. Device-Specific Optimizations: Implementations tailored to specific hardware architectures, such as GPUs and TPUs, can maximize the performance benefits of SOAP. Future work could focus on optimizing SOAP's implementation for such platforms to fully leverage their computational capabilities.

Conclusion

In summary, the SOAP algorithm presents a crucial step forward in optimization for deep learning, combining the strengths of various existing optimizers while mitigating their individual shortcomings. The empirical results substantiate its efficacy in reducing training time and computational iterations, presenting an attractive option for practitioners in the field. The conceptual foundation laid by connecting Shampoo and Adafactor opens up new avenues for future explorations in optimization algorithms.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com