Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training (2505.13738v1)

Published 19 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate {\eta} and weight decay {\lambda}. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, B/({\eta}{\lambda}D), should remain constant across training settings, and we verify the implication that optimal {\lambda} scales linearly with B, for a fixed N,D. However, as N,D scale, we show the optimal timescale obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict {\lambda}opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast with prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives.

Summary

The paper shows that tuning weight decay via a derived power law on the AdamW timescale is key for efficient LLM pre-training.
It finds that optimal and critical batch sizes scale as power laws in dataset size (approximately D^0.4 and D^0.5), independent of model size.
The study offers a practical roadmap using μP and loss-data scaling laws to balance training time and compute costs in large-scale training.

Efficiently pre-training LLMs at scale requires careful tuning of hyperparameters like learning rate ( $\eta$ ) and weight decay ( $\lambda$ ). Traditional hyperparameter sweeping is often infeasible for the largest models due to computational costs. This paper presents scaling laws derived from hundreds of training runs to guide the selection of weight decay, batch size, model size ( $N$ ), and dataset size ( $D$ ) for optimal training performance, particularly focusing on practical tradeoffs between training time and total compute.

The paper investigates the scaling behavior of the AdamW timescale, defined as $\mathcal{T}_{epoch} = B / (\eta\lambda D)$ . This metric represents the effective fraction of the training data over which weight updates are averaged. While previous work suggested keeping this constant for multi-epoch training, this paper demonstrates that for LLM pre-training (typically one epoch), the optimal $\mathcal{T}_{epoch}$ is not constant but follows a precise power law in the tokens-per-parameter ratio ( $TPP = D/N$ ). The fitted law shows $\mathcal{T}_{epoch} \propto TPP^{-0.5}$ , meaning the optimal timescale decreases as models are trained on more data relative to their size.

A key practical implication is that, when using the AdamW optimizer and the Maximal Update Parameterization ( $\mu$ P) framework to set the learning rate, the weight decay ( $\lambda$ ) should be the primary hyperparameter adjusted as batch size ( $B$ ) and dataset size ( $D$ ) change. The paper shows empirically that tuning $\lambda$ to maintain the optimal $\mathcal{T}_{epoch}$ is more effective than tuning $\eta$ as $B$ or $D$ varies. This provides a concrete recipe for practitioners: use $\mu$ P to set $\eta$ based on model size $N$ , and then set $\lambda$ using the derived scaling law for $\mathcal{T}_{epoch}$ and the formula $\lambda_{opt} = \frac{B}{\eta \cdot D \cdot \mathcal{T}_{epoch}(D/N)}$ . The linear relationship between optimal $\lambda$ and $B$ for fixed $N, D$ holds up to a certain batch size, further supporting adjusting $\lambda$ with $B$ .

The paper also provides insights into optimal batch size ( $B_{opt}$ ) and critical batch size ( $B_{crit}$ ). $B_{opt}$ is the batch size that achieves the lowest loss for a given $N$ and $D$ . $B_{crit}$ is defined based on an empirical model of the tradeoff between the number of tokens ( $D$ ) and the number of optimization steps ( $S$ ) required to reach a target loss $L$ . The model is expressed as $S/S_{min} - 1 = (D/D_{min} - 1)^{-1}$ , where $D_{min}$ and $S_{min}$ are the minimum tokens and steps, respectively, and $B_{crit} = D_{min}/S_{min}$ . The paper introduces a novel method to estimate $B_{crit}$ by fitting batch-size-specific loss-data scaling laws and interpolating the data needed for a target loss, avoiding the need for dense checkpoint evaluation or constant learning rates.

Contrary to some prior work that suggested $B_{opt}$ and $B_{crit}$ scale primarily with total compute ( $C$ ) or target loss ( $L$ ), this paper finds that both $B_{opt}$ and $B_{crit}$ scale as power laws in the dataset size $D$ , largely independent of model size $N$ . Specifically, the findings suggest $B_{opt} \propto D^{0.4}$ and $B_{crit} \propto D^{0.5}$ . This aligns with recent concurrent work (2410.21676), reinforcing the fundamental dependence of optimal and critical batch sizes on the amount of data used. Practically, this means practitioners can estimate these values from small-scale runs and extrapolate based on the training data size. The $D-S$ tradeoff equation $D = D_{min}(1 + B/B_{crit})$ can then be used to understand the computational cost (proportional to $D$ ) and training time (proportional to $S=D/B$ and $N$ ) implications of choosing a particular batch size $B$ for a given $N$ and target loss $L$ .

Using these derived scaling laws, the paper analyzes the Pareto-optimal configurations for $N, D, B$ to achieve a target loss $L$ while balancing training time and compute. For a fixed total compute budget, traditional advice suggests training at roughly 20 TPP to minimize loss. However, considering training time (which decreases with larger $B$ , but increases total $D$ and compute via the $D-S$ tradeoff), the analysis reveals that smaller, over-trained models (TPP > 20) can be Pareto-optimal. This is because overtrained models are trained on larger datasets, leading to higher $B_{crit}$ and thus allowing for more efficient use of larger batch sizes to reduce training time, even if the total compute is higher than the 20 TPP optimum. It is shown to be Pareto-inefficient to target 20 TPP actual training TPP when using very large batch sizes ( $B \gg B_{opt}$ ).

The core practical takeaways are:

When using AdamW and $\mu$ P, fix the learning rate based on model width and tune weight decay ( $\lambda$ ) based on the derived $\mathcal{T}_{epoch}$ scaling law ( $\propto (D/N)^{-0.5}$ ) and batch size ( $B$ ).
Optimal and critical batch sizes ( $B_{opt}, B_{crit}$ ) scale with dataset size $D$ , not total compute $C$ or target loss $L$ . Estimate their scaling from small runs ( $\approx D^{0.4}$ and $\approx D^{0.5}$ ) and use these laws to select $B$ for large-scale training.
Leverage the $D-S$ tradeoff model and the $B_{crit}$ scaling law to plan training runs that optimally balance total compute (FLOPs) and training time, especially considering the benefits of larger batch sizes. This may favor smaller, over-trained models for faster training times at a given performance level.

Implementation requires:

Train a proxy model with $\mu$ P to find base hyperparameters, including $\eta_{base}$ .
For target model width $W$ , set peak $\eta = \eta_{base} \cdot (W_{proxy}/W)$ .
From limited small-scale experiments across various $N, D, B, \lambda$ $N, D, B, λ$ and losses, fit:
- The optimal $\mathcal{T}_{epoch}$ power law with TPP [(2505.13738), Eq. 3].
- The $B$ -specific loss-data power laws $L_B(D)$ [(2505.13738), Fig. 4].
- The $D-S$ tradeoff curve $S/S_{min} - 1 = (D/D_{min} - 1)^{-1}$ [(2505.13738), Eq. 6], yielding $D_{min}$ and $S_{min}$ for various losses. Calculate $B_{crit} = D_{min}/S_{min}$ .
- The $B_{crit}$ power law with $D_{min}$ [(2505.13738), Eq. 7].
For a desired target loss $L$ $L$ , model size $N$ $N$ , and dataset size $D$ $D$ :
- Predict the target $D_{min}$ for loss $L$ at size $N$ (e.g., using a loss scaling law like [(2203.02155), Eq. 1]).
- Predict $B_{crit}$ for this $D_{min}$ using the fitted power law.
- Choose a batch size $B$ . The required dataset size will be $D = D_{min}(1 + B/B_{crit})$ . Ensure the actual dataset used is at least this size.
- Calculate the required steps $S = D/B$ . Configure the LR schedule accordingly.
- Calculate the target $\mathcal{T}_{epoch}$ for the chosen $D/N$ ratio.
- Set $\lambda = B / (\eta \cdot D \cdot \mathcal{T}_{epoch})$ .

This systematic approach, grounded in empirical scaling laws, offers a more predictable and efficient way to navigate the complex hyperparameter space for large-scale LLM pre-training, enabling better control over training costs and time. The findings also highlight that achieving the fastest training time for a given performance level might involve training models on significantly more data than the compute-optimal minimum.

Some limitations mentioned include the focus on AdamW and a specific LR schedule shape, the need for more data/architectural variations in future work, and the empirical observation that small batches degrade performance more than theory suggests, potentially requiring tuning beyond $\lambda$ . Practical systems constraints of very large batches are also not explicitly modeled.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/ShaneBergsma/status/1925186578244149286

https://twitter.com/GptMaestro/status/1934791952630993412