Scaling Laws for Adam Optimizers
- The paper presents a novel scaling law that quantifies the optimal learning rate’s surge and decline based on batch size and noise characteristics in Adam-style optimizers.
- It uses theoretical tools like Taylor expansion and SDE approximations to derive a non-linear, surge phenomenon distinct from traditional SGD scaling.
- Empirical validations across CNNs, Transformers, and NLP benchmarks provide actionable guidelines for hyperparameter tuning and resource-efficient training.
A new scaling law for Adam-style optimizers refers to mathematical and empirical prescriptions which describe how optimizer hyperparameters and update dynamics—such as learning rate, batch size, damping constants, adaptive scaling, and architectural factors—should be adjusted as a function of model size, data size, iteration count, or architectural details to achieve optimal or robust convergence, stability, and generalization in modern deep learning. This topic has recently seen theoretical, algorithmic, and large-scale empirical advances, reflecting the centrality of Adam-type methods in the training of large neural networks.
1. Distinctive Mechanism of Adam-Style Scaling
Adam-style optimizers differ fundamentally from stochastic gradient descent (SGD) in their coupling of update scaling to adaptive moment estimation, most notably through per-parameter normalization by bias-corrected running averages of squared gradients () augmented by a stability term (). Standard Adam produces updates of the form:
where and are bias-corrected first and second moment estimates, and is a small constant mainly for numerical safety. Recent research has demonstrated that the update's behavior, including its scaling with respect to batch size, learning rate, and model dimension, is distinct from SGD due to the "sign-like" operation inherent in the update formula. For Adam-style optimizers, the update is approximately
where is a stochastic estimate of the gradient. This produces a non-linear dependence of the optimal learning rate on the batch size, contrasting with linear or power-law scaling typical of SGD (Li et al., 23 May 2024).
2. Derivation and Formulation of the New Scaling Law
Comprehensive recent theoretical work derives the optimal learning rate for Adam-type optimizers as a function of batch size and gradient statistics. The analysis, using Taylor expansion of the loss and an explicit computation of the expectation and variance of , yields:
where is a characteristic noise scale governed by the gradient's covariance and Hessian structure, and the peak occurs at . This law predicts a "surge phenomenon": as increases, initially increases roughly as , reaches a maximum at the noise scale, and then declines or saturates for large . This non-monotonicity is absent in SGD, for which the optimal learning rate typically scales linearly or as a simple power law in (Li et al., 23 May 2024).
Table: Differences in Learning Rate Scaling Laws
Optimizer | scaling law | Peak/Surge Behavior |
---|---|---|
SGD | (usually or $0.5$) | Monotonic increase |
Adam-style | Surge: rise then decline |
This law has been experimentally validated across CNNs, ResNets, and Transformer architectures on vision and NLP benchmarks. The batch size at which the optimal learning rate peaks increases as the loss decreases during training.
3. Algorithmic and Architectural Factors Influencing Scaling
Recent advances reveal that scaling laws for Adam-style optimizers are sensitively dependent on hyperparameter settings, architectural details, and normalization mechanisms. For example:
- Epsilon Placement: Placing inside the second moment accumulator (as in EAdam) introduces effective "pre-damping" and produces an adaptively scaled stabilization term, changing the denominator from to and modifying the scaling law, especially near convergence (Yuan et al., 2020).
- Layerwise or Architecture-Aware Scaling: Techniques such as SET-Adam and CaAdam incorporate layerwise statistics or connection counts to compress the variance of per-parameter stepsizes towards that of SGD+momentum, thereby improving generalization (Zhang, 2023, Genet et al., 31 Oct 2024).
- Per-Layer and Scale-Invariant Updates: Adjusting learning rates for each layer according to parameter alignment exponents or width (e.g., scaling as for embeddings and for hidden/readout layers in standard parameterizations) yields robust hyperparameter transfer across varying model sizes (Everett et al., 8 Jul 2024).
- Role of Epsilon Underflow: At very large scale, the constant can become non-negligible as gradient norms diminish, leading to numerical underflow, and prompting the introduction of the Adam-atan2 optimizer, which is scale-invariant and eliminates explicit dependence on (Everett et al., 8 Jul 2024).
4. Theoretical and Empirical Justification
The scaling law is anchored in both theoretical analyses and large-scale empirical studies. Theoretical derivations employ:
- Taylor expansion and Gaussian approximations for the signed update (Li et al., 23 May 2024).
- SDE approximations that rigorously relate discrete updates to continuous stochastic processes, revealing that the interval per Adam step contracts with and not as in SGD, implying that hyperparameters such as , learning rate, and require coordinated rescaling as a function of batch size (Malladi et al., 2022).
- Generalization of escape times from local minima using SDE and Lévy process analysis, which also exposes how modifications to second-moment estimation or momentum (as in AdaMomentum) affect the preference for flatter minima (Wang et al., 2021).
On the empirical side, exhaustive grid searches on thousands of LLMs (including dense Transformers and Mixture-of-Experts) yield power-law fits for optimal learning rate and batch size:
with the number of parameters and the dataset size. The learning rate depends on both model and data scale, while optimal batch size scales sublinearly with data size. The loss landscape is found to be convex in the learning rate and batch size, producing a plateau of near-optimal settings (Li et al., 6 Mar 2025).
5. Broader Implications, Extensions, and Deployment
The new scaling law for Adam-style optimizers has multiple implications:
- Resource-Efficient Training: Provides direct plug-and-play formulas for setting optimizer hyperparameters, reducing the cost of exhaustive tuning, and enabling efficient use of massive compute budgets in large-scale LLM pretraining (Li et al., 6 Mar 2025).
- Adaptivity Beyond SGD: Shows that adaptive optimizers can scale to problem settings where gradient variance is highly non-uniform (e.g., in heavy-tailed distributions such as Zipf's law for token frequencies), achieving convergence in iterations compared to for vanilla gradient descent, where is the vocabulary size (Kunstner et al., 25 May 2025).
- Meta-Optimization and Interpolated Optimizers: Meta-adaptive frameworks (e.g., MADA) interpolate between optimizer behaviors via hyper-gradient descent, dynamically adjusting scaling exponents to the current loss landscape, yielding improved generalization and robustness over fixed prescriptions (Ozkara et al., 17 Jan 2024).
- Distributed and Federated Learning: Incorporating local adaptive scaling into communication- and memory-constrained environments allows accelerated convergence and improved robustness in both homogeneous and heterogeneous data partitions, as demonstrated in federated variants of Adam-style updates (Chezhegov et al., 2 Jun 2024).
6. Practical Guidelines for Hyperparameter Tuning
The emergent scaling laws provide actionable prescriptions for practitioners:
- Adjust the learning rate such that for a given model and data scale, , and batch size as (Li et al., 6 Mar 2025).
- For varying batch sizes, apply the "square root scaling rule" to scale learning rate, momenta, and :
where is the batch size multiplier (Malladi et al., 2022).
- For extremely large models or layers with rapidly vanishing gradient scale, mitigate the dominance of the term via per-layer -rescaling or substitution with scale-invariant mechanisms such as the atan2 operation (Everett et al., 8 Jul 2024).
- Exploit architecture-aware or meta-adaptive schemes for per-layer (or per-connection) scaling when feasible to better match the optimizer's dynamics to the network topology (Genet et al., 31 Oct 2024).
7. Open Issues and Directions
While the current scaling laws achieve high robustness and near-optimality in a variety of regimes, open challenges remain:
- Dynamic Noise Scale Determination: Accurately and efficiently estimating during training for truly online scaling.
- Interplay with Advanced Normalization: More general characterization of the effects of normalization layers—now shown to have "meta-adaptive" effects akin to double normalization in optimizers—on optimizer scaling (Gould et al., 8 Nov 2024).
- Unified Theoretical Frameworks: Further formalization and cross-validation across more model classes, including convolutional nets, recurrent architectures, and graph-based models, especially in presence of heavy-tailed data distributions (Kunstner et al., 25 May 2025).
- Memory-Efficient and Minimalist Adaptive Designs: Ongoing work (e.g., SCALE) indicates that careful localization of adaptivity and normalization can retain empirical scaling while reducing memory and computational costs—in turn influencing broader scaling laws relating optimizer state size and training efficiency (Glentis et al., 20 Jun 2025).
In summary, the recent body of work crystallizes a new generation of scaling laws for Adam-style optimizers. These laws capture the non-monotonic behavior of the optimal learning rate in relation to batch size, prescribe robust per-layer or per-architecture adaptation strategies, and are validated empirically in LLM-scale regimes. They deliver a rigorous basis for principled hyperparameter selection and efficient optimizer deployment in large-scale, heterogeneous, and distributed deep learning systems.