Papers
Topics
Authors
Recent
Search
2000 character limit reached

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

Published 19 May 2025 in cs.LG, cs.AI, and cs.CL | (2505.13738v1)

Abstract: Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate {\eta} and weight decay {\lambda}. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, B/({\eta}{\lambda}D), should remain constant across training settings, and we verify the implication that optimal {\lambda} scales linearly with B, for a fixed N,D. However, as N,D scale, we show the optimal timescale obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict {\lambda}opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast with prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives.

Summary

  • The paper shows that tuning weight decay via a derived power law on the AdamW timescale is key for efficient LLM pre-training.
  • It finds that optimal and critical batch sizes scale as power laws in dataset size (approximately D^0.4 and D^0.5), independent of model size.
  • The study offers a practical roadmap using μP and loss-data scaling laws to balance training time and compute costs in large-scale training.

Efficiently pre-training LLMs at scale requires careful tuning of hyperparameters like learning rate (η\eta) and weight decay (λ\lambda). Traditional hyperparameter sweeping is often infeasible for the largest models due to computational costs. This paper presents scaling laws derived from hundreds of training runs to guide the selection of weight decay, batch size, model size (NN), and dataset size (DD) for optimal training performance, particularly focusing on practical tradeoffs between training time and total compute.

The paper investigates the scaling behavior of the AdamW timescale, defined as Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D). This metric represents the effective fraction of the training data over which weight updates are averaged. While previous work suggested keeping this constant for multi-epoch training, this study demonstrates that for LLM pre-training (typically one epoch), the optimal Tepoch\mathcal{T}_{epoch} is not constant but follows a precise power law in the tokens-per-parameter ratio (TPP=D/NTPP = D/N). The fitted law shows TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}, meaning the optimal timescale decreases as models are trained on more data relative to their size.

A key practical implication is that, when using the AdamW optimizer and the Maximal Update Parameterization (μ\muP) framework to set the learning rate, the weight decay (λ\lambda) should be the primary hyperparameter adjusted as batch size (λ\lambda0) and dataset size (λ\lambda1) change. The paper shows empirically that tuning λ\lambda2 to maintain the optimal λ\lambda3 is more effective than tuning λ\lambda4 as λ\lambda5 or λ\lambda6 varies. This provides a concrete recipe for practitioners: use λ\lambda7P to set λ\lambda8 based on model size λ\lambda9, and then set NN0 using the derived scaling law for NN1 and the formula NN2. The linear relationship between optimal NN3 and NN4 for fixed NN5 holds up to a certain batch size, further supporting adjusting NN6 with NN7.

The study also provides insights into optimal batch size (NN8) and critical batch size (NN9). DD0 is the batch size that achieves the lowest loss for a given DD1 and DD2. DD3 is defined based on an empirical model of the tradeoff between the number of tokens (DD4) and the number of optimization steps (DD5) required to reach a target loss DD6. The model is expressed as DD7, where DD8 and DD9 are the minimum tokens and steps, respectively, and Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)0. The paper introduces a novel method to estimate Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)1 by fitting batch-size-specific loss-data scaling laws and interpolating the data needed for a target loss, avoiding the need for dense checkpoint evaluation or constant learning rates.

Contrary to some prior work that suggested Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)2 and Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)3 scale primarily with total compute (Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)4) or target loss (Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)5), this paper finds that both Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)6 and Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)7 scale as power laws in the dataset size Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)8, largely independent of model size Tepoch=B/(ηλD)\mathcal{T}_{epoch} = B / (\eta\lambda D)9. Specifically, the findings suggest Tepoch\mathcal{T}_{epoch}0 and Tepoch\mathcal{T}_{epoch}1. This aligns with recent concurrent work (Zhang et al., 2024), reinforcing the fundamental dependence of optimal and critical batch sizes on the amount of data used. Practically, this means practitioners can estimate these values from small-scale runs and extrapolate based on the training data size. The Tepoch\mathcal{T}_{epoch}2 tradeoff equation Tepoch\mathcal{T}_{epoch}3 can then be used to understand the computational cost (proportional to Tepoch\mathcal{T}_{epoch}4) and training time (proportional to Tepoch\mathcal{T}_{epoch}5 and Tepoch\mathcal{T}_{epoch}6) implications of choosing a particular batch size Tepoch\mathcal{T}_{epoch}7 for a given Tepoch\mathcal{T}_{epoch}8 and target loss Tepoch\mathcal{T}_{epoch}9.

Using these derived scaling laws, the paper analyzes the Pareto-optimal configurations for TPP=D/NTPP = D/N0 to achieve a target loss TPP=D/NTPP = D/N1 while balancing training time and compute. For a fixed total compute budget, traditional advice suggests training at roughly 20 TPP to minimize loss. However, considering training time (which decreases with larger TPP=D/NTPP = D/N2, but increases total TPP=D/NTPP = D/N3 and compute via the TPP=D/NTPP = D/N4 tradeoff), the analysis reveals that smaller, over-trained models (TPP > 20) can be Pareto-optimal. This is because overtrained models are trained on larger datasets, leading to higher TPP=D/NTPP = D/N5 and thus allowing for more efficient use of larger batch sizes to reduce training time, even if the total compute is higher than the 20 TPP optimum. It is shown to be Pareto-inefficient to target 20 TPP actual training TPP when using very large batch sizes (TPP=D/NTPP = D/N6).

The core practical takeaways are:

  • When using AdamW and TPP=D/NTPP = D/N7P, fix the learning rate based on model width and tune weight decay (TPP=D/NTPP = D/N8) based on the derived TPP=D/NTPP = D/N9 scaling law (TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}0) and batch size (TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}1).
  • Optimal and critical batch sizes (TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}2) scale with dataset size TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}3, not total compute TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}4 or target loss TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}5. Estimate their scaling from small runs (TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}6 and TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}7) and use these laws to select TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}8 for large-scale training.
  • Leverage the TepochTPP0.5\mathcal{T}_{epoch} \propto TPP^{-0.5}9 tradeoff model and the μ\mu0 scaling law to plan training runs that optimally balance total compute (FLOPs) and training time, especially considering the benefits of larger batch sizes. This may favor smaller, over-trained models for faster training times at a given performance level.

Implementation requires:

  1. Train a proxy model with μ\mu1P to find base hyperparameters, including μ\mu2.
  2. For target model width μ\mu3, set peak μ\mu4.
  3. From limited small-scale experiments across various μ\mu5 and losses, fit:
    • The optimal μ\mu6 power law with TPP [(2505.13738), Eq. 3].
    • The μ\mu7-specific loss-data power laws μ\mu8 [(2505.13738), Fig. 4].
    • The μ\mu9 tradeoff curve λ\lambda0 [(2505.13738), Eq. 6], yielding λ\lambda1 and λ\lambda2 for various losses. Calculate λ\lambda3.
    • The λ\lambda4 power law with λ\lambda5 [(2505.13738), Eq. 7].
  4. For a desired target loss λ\lambda6, model size λ\lambda7, and dataset size λ\lambda8:
    • Predict the target λ\lambda9 for loss λ\lambda00 at size λ\lambda01 (e.g., using a loss scaling law like [(Ouyang et al., 2022), Eq. 1]).
    • Predict λ\lambda02 for this λ\lambda03 using the fitted power law.
    • Choose a batch size λ\lambda04. The required dataset size will be λ\lambda05. Ensure the actual dataset used is at least this size.
    • Calculate the required steps λ\lambda06. Configure the LR schedule accordingly.
    • Calculate the target λ\lambda07 for the chosen λ\lambda08 ratio.
    • Set λ\lambda09.

This systematic approach, grounded in empirical scaling laws, offers a more predictable and efficient way to navigate the complex hyperparameter space for large-scale LLM pre-training, enabling better control over training costs and time. The findings also highlight that achieving the fastest training time for a given performance level might involve training models on significantly more data than the compute-optimal minimum.

Some limitations mentioned include the focus on AdamW and a specific LR schedule shape, the need for more data/architectural variations in future work, and the empirical observation that small batches degrade performance more than theory suggests, potentially requiring tuning beyond λ\lambda10. Practical systems constraints of very large batches are also not explicitly modeled.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 196 likes about this paper.