No More Adam: Learning Rate Scaling at Initialization is All You Need (2412.11768v2)

Published 16 Dec 2024 in cs.LG and cs.AI

Abstract: In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for LLMs (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

Citations (1)

View on Semantic Scholar

Summary

The paper presents SGD-SaI, a novel SGD variant that leverages gradient signal-to-noise ratios at initialization to eliminate the need for adaptive moment estimates.
It achieves up to 50% memory savings by removing second-order momentum, making it highly scalable for large models like GPT-2 and Vision Transformers.
Empirical evaluations on diverse vision and language tasks show that SGD-SaI consistently matches or exceeds the performance of traditional adaptive optimizers such as AdamW.

Overview of "No More Adam: Learning Rate Scaling at Initialization is All You Need"

The paper "No More Adam: Learning Rate Scaling at Initialization is All You Need" presents a novel approach to optimizing deep neural networks by challenging the common reliance on adaptive gradient methods such as Adam and AdamW. The authors propose an enhancement to Stochastic Gradient Descent with Momentum (SGDM) called Learning Rate Scaling at Initialization (SGD-SaI), which recalibrates learning rates based on the gradient's signal-to-noise ratio (g-SNR) without employing adaptive second-order momentum. This method is specifically designed to improve memory efficiency and computational cost, and its efficacy is demonstrated across various applications, notably surpassing AdamW on tasks involving Transformer models like Vision Transformers (ViT) and GPT-2.

Key Contributions

Gradient Signal-to-Noise Ratio (g-SNR): The paper introduces the use of g-SNR to guide the assignment of learning rates to different parameter groups at the outset of training. This preemptive adjustment aims to counteract learning imbalances from the very first training iteration, offering a strategic alternative to the adaptive steps in conventional optimization methods.
Reduction in Memory Usage: By forgoing the need to store and update second-order momentum terms, SGD-SaI significantly reduces memory requirements. This is particularly beneficial as model sizes escalate, such as in GPT-2 (1.5 billion parameters) and Llama2-7B, where memory savings of up to 50% are reported compared to AdamW in full-precision training.
Empirical Efficacy Across Tasks: The paper reports robust performance across various vision and language tasks, including ImageNet-1K classification with ViTs and LLM pre-training. The method's performance remains consistent even when hyperparameters are varied, demonstrating both robustness and practical applicability in diverse settings.

Performance and Implications

The introduction of g-SNR effectively bypasses the need for adaptive gradient methods, which are known for their substantial memory overhead due to the storage of first and second-order gradient moments. The authors show that fixed learning rate scaling based on initial gradient statistics is sufficient to maintain or even improve upon the performance of established optimizers like AdamW. This finding could streamline training pipelines, reducing resource demands and potentially accelerating the training process by leveraging more simplistic yet effective optimizers.

Future Directions

While the results are promising, further exploration is warranted to confirm the scalability of SGD-SaI to even larger models and more complex tasks. Additionally, integrating this method into current deep learning frameworks could facilitate broader adoption. The approach suggests a shift towards simpler, more memory-efficient optimization schemes that retain competitive performance, potentially setting a precedent for future research in optimization techniques for large-scale models.

Overall, "No More Adam: Learning Rate Scaling at Initialization is All You Need" presents a compelling case for revisiting the fundamentals of gradient optimization by emphasizing efficiency and effectiveness in hyperparameter adjustment without heavily relying on adaptive mechanisms. The implications for both theoretical understanding and practical application in the field of AI optimization are substantial, warranting further investigation and potential industry adoption.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1872301037522190421

https://twitter.com/betterhn50/status/1869352678322585688

https://twitter.com/susumuota/status/1869534520581492992

https://twitter.com/CompsciDiscu/status/1870742773109780676

https://twitter.com/betterhn20/status/1869311144692703464

https://twitter.com/arxivsanitybot/status/1869573137760919839