Papers
Topics
Authors
Recent
2000 character limit reached

When Can You Get Away with Low Memory Adam? (2503.01843v3)

Published 3 Mar 2025 in cs.LG, cond-mat.dis-nn, and stat.ML

Abstract: Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $\textit{SlimAdam}$, a memory-efficient Adam variant. $\textit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $\textit{SlimAdam}$ matches Adam's performance and stability while saving up to $98\%$ of total second moments. Code for $\textit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.

Summary

  • The paper introduces SlimAdam, a memory-efficient Adam variant that adaptively compresses second moments based on a novel layer-wise signal-to-noise ratio (SNR) analysis.
  • SlimAdam achieves performance and stability comparable to standard Adam while saving up to 98% of second moments, demonstrated across various architectures and tasks.
  • Experimental results show SlimAdam maintaining stable training dynamics, outperforming other low-memory Adam variants like AdaLayer and Adam-mini, especially at high learning rates.

This paper introduces SlimAdam, a memory-efficient variant of Adam that adaptively compresses second moments along dimensions with high signal-to-noise ratio (SNR). The authors apply a layer-wise SNR analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions, revealing how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory.

  • The authors define SNRK(Vt)=EK′[(EK[Vt])2VarK[Vt]]\text{SNR}_K(V_t) = \mathbb{E}_{K'} \left[\frac{\left(\mathbb{E}_{K}[V_t]\right)^2}{\text{Var}_{K}[V_t]} \right] to quantify the feasibility of compression along dimensions KK of the second moment matrix $V \in \mathbb{R}^{\text{fan}_{\text{out} \times \text{fan}_{\text{in}$ at step tt, where EK[â‹…]\mathbb{E}_{K}[\cdot] and VarK[â‹…]\text{Var}_{K}[\cdot] compute the mean and variance along the specified dimensions KK, and EK′[â‹…]\mathbb{E}_{K'} [\cdot] averages the ratio over the remaining dimensions.
  • The proposed SlimAdam compresses the second moments along dimensions with high SNR when feasible and leaves them uncompressed when compression would be detrimental, matching Adam's performance and stability while saving up to 98%98\% of total second moments in a ∼124M\sim 124M parameter GPT-style Transformer trained on language tasks.
  • Experimental results across diverse architectures (GPT-small, GPT-medium, Llama-3.2, ResNet-18, and ViT) and training scenarios (language pre-training, fine-tuning, and image classification) demonstrate that SlimAdam achieves Adam-level performance and stability, and exhibits more stable training dynamics at large learning rates compared to other low-memory Adam variants like AdaLayer and Adam-mini.

List of variables:

  • VV: second moment matrix
  • KK: specified compression dimensions
  • tt: step
  • E\mathbb{E}: mean
  • Var\text{Var}: variance
  • SNR\text{SNR}: signal-to-noise ratio

Whiteboard

Video Overview

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 26 likes about this paper.