SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training (2412.13148v3)

Published 17 Dec 2024 in cs.LG and cs.AI

Abstract: Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of LLMs. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In LLMing tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.

Summary

The paper introduces the SWAN optimizer that preprocesses SGD using GradNorm and GradWhitening, achieving performance comparable to Adam in LLM training.
It demonstrates a 50% reduction in memory usage and more than a two-fold speedup in convergence on models ranging up to 1.3B parameters.
The authors provide theoretical insights showing that SWAN’s convergence is independent of Hessian condition, ensuring robustness against gradient noise.

An Examination of "SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction"

The paper "SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction" presents a novel approach to optimizing LLMs by introducing the SWAN optimizer. This method aims to address the memory and performance limitations of traditional optimizers like Adam by enhancing stochastic gradient descent (SGD) with innovative preprocessing techniques.

Key Contributions

SWAN Optimizer Design:

The SWAN optimizer is designed to enhance the efficiency of SGD by incorporating two preprocessing operations: GradNorm and GradWhitening. These operations aim to stabilize gradient distributions and neutralize the local curvature of the loss landscape, respectively. The idea is to preprocess gradients in a manner that reduces the memory requirements while achieving comparable or superior performance to adaptive methods like Adam.

GradNorm: This operator standardizes gradients across output dimensions in a manner reminiscent of layer normalization, but applied to gradients. By doing so, GradNorm provides invariant scaling properties, which are crucial for consistent optimization performance across different settings.
GradWhitening: By orthogonalizing gradients, GradWhitening performs a form of whitening that makes the optimization process less sensitive to ill-conditioned loss landscapes. This is akin to second-order optimization methods, offering robustness against the heterogeneous scaling of the gradient space.

Empirical Evaluation

The paper reports empirical evaluations on LLaMA models of varying sizes, ranging from 60M to 1.3B parameters. The authors claim that SWAN not only reduces memory overhead by approximately 50% compared to Adam but also demonstrates superior convergence speed, notably achieving more than a two-fold speedup in certain settings. The evaluation on LLaMA models indicates that SWAN can reach target perplexity values while leveraging significantly fewer computational tokens than Adam.

Theoretical Insights

The theoretical analysis provided suggests that SWAN can offer robust convergence properties by leveraging structured assumptions about the Hessian matrix, which is often encountered in transformer dynamics. The GradWhitening operator is shown to be theoretically equivalent to a non-diagonal second-order update under specific structural conditions, which helps avoid the computational penalties of second-order methods while still benefiting from their advantages.

Condition Number Independence: The convergence rate of SWAN is demonstrated to be independent of the condition number of the Hessian, addressing a critical limitation of SGD in non-convex optimization problems.
Stability Across Time: The theoretical framework suggests that GradNorm can stabilize the gradient distribution across time, effectively removing the time-variant noise that often complicates optimization processes in SGD.

Implications and Future Directions

The introduction of SWAN prompts significant implications for both practical applications and theoretical developments in optimization for LLMs. Practically, the reduction in memory overhead makes training feasible on hardware with limited resources, potentially democratizing access to high-performance LLM training. Theoretically, SWAN's approach to dynamically shaping the optimization landscape without the heavy machinery of full second-order methods offers new avenues for research into efficient, scalable neural network training.

Future Work: Future research might focus on refining the preprocessing techniques in SWAN, exploring the full spectrum of hyperparameter configurations, and assessing the generalizability of these methods across diverse model architectures. Additionally, investigating the interplay between SWAN and sparse or quantized neural networks could further enhance resource efficiency.

In sum, the SWAN optimizer proposes an insightful shift towards efficient optimization in large-scale machine learning, offering both pragmatic and theoretical benefits that could shape the evolution of training algorithms in AI research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/kellerjordan0/status/1869752568026689767

https://twitter.com/fly51fly/status/1874566559160832039

https://twitter.com/arxivsanitybot/status/1870100313941623068