Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

APOLLO: SGD-like Memory, AdamW-level Performance (2412.05270v4)

Published 6 Dec 2024 in cs.LG, cs.AI, and cs.PF

Abstract: LLMs are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

Citations (1)

Summary

  • The paper introduces APOLLO, showing that low-rank gradient scaling and structured learning rate updates can achieve AdamW-level performance with SGD-level memory usage.
  • It demonstrates significant throughput improvements, with experiments revealing up to 3× faster training and efficient scaling for large models like LLaMA-13B and LLaMA-7B.
  • The approach enables democratized LLM training by substantially reducing memory overhead, offering a scalable solution for resource-constrained environments.

Overview of APOLLO: SGD-like Memory, AdamW-level Performance

The paper "APOLLO: SGD-like Memory, AdamW-level Performance" introduces innovative solutions for enhancing the memory efficiency of training LLMs without sacrificing performance. The proposed approach, termed APOLLO, addresses the significant memory overhead associated with the widely-used AdamW optimizer, attributed to its maintenance of both first and second moments, which can triple the memory requirements.

Core Contributions

The authors identify redundancy in the learning rate adaptation rule of AdamW and leverage this observation to present the APOLLO optimizer. APOLLO uses a novel approximation approach to gradient scaling, employing structured updates (channel-wise and tensor-wise) that are highly tolerant of memory reductions achieved via low-rank auxiliary optimizer states using random projection. They also propose APOLLO-Mini, an extreme low-rank variant that achieves SGD-level memory costs while surpassing SGD in performance.

Key contributions of APOLLO include:

  • Structured Learning Rate Update: This is achieved through coarsening the element-wise learning rate to a channel-wise or tensor-wise approach, reducing computational burdens.
  • Low-Rank Gradient Scaling: APOLLO approximates gradient scaling in a low-rank space, using random projections to retain memory efficiency while maintaining competitive performance.
  • Minimal-Rank Tensor-Wise Gradient Scaling: APOLLO-Mini employs a rank-1 scaling method resulting in performance greater than AdamW but with memory usage akin to SGD.

Numerical Results

Experiments reveal that APOLLO and APOLLO-Mini achieve comparable or superior performance to AdamW with substantially reduced memory requirements. Notably, the optimizers yield improved throughput, achieving around 3× on an 8×A100-80GB setup by supporting larger batch sizes. For instance, APOLLO pre-trains the LLaMA-13B model on an A100-80GB GPU setup without additional system-level optimizations. Additionally, APOLLO-Mini facilitates training LLaMA-7B on a single GPU using less than 12 GB thanks to memory-efficient approaches.

Implications and Future Developments

The APOLLO series presents a significant advancement in training LLMs, as it introduces a method to effectively retain performance while drastically reducing memory overhead. This advancement implies a potential paradigm shift in how models are trained, specifically in environments with resource constraints. Future developments might explore further optimizing the random projection matrices used or extending these methods to other forms of optimizations in model training.

These results also hint at a broader implication: the feasibility of democratizing access to LLM training, allowing practitioners with limited computational resources to engage in large-scale model training that was previously restricted to those with high-end hardware.

In summary, APOLLO innovatively addresses a pressing challenge in the field of LLM training. By making efficient use of memory, it paves the way for sustained increases in model size and capability, while remaining accessible to a broader audience in the research community.

HackerNews