PLM Scaling Laws: Theory & Practice

Updated 13 July 2025

PLM scaling laws are power-law relationships that define how model loss decreases with increased parameters, data, and compute.
They enable researchers to predict performance and optimally allocate resources, balancing model size with dataset scale.
Empirical and theoretical findings confirm that these laws inform hyperparameter tuning, architecture design, and cross-domain applications.

Pre-trained LLM (PLM) scaling laws are empirical and theoretical relationships describing how the performance of LLMs depends on training resources such as parameter count, dataset size, and compute. These power-law relationships enable accurate prediction of model quality, guide optimal resource allocation, and inform recommendations for model design across a variety of neural architectures and domains, including language, speech, and multi-modal tasks. The paper of PLM scaling laws has shaped contemporary deep learning research by elucidating the trade-offs and limitations inherent in very large-scale learning systems.

1. Mathematical Formulation of PLM Scaling Laws

Scaling laws most commonly manifest as smooth, monotonic power-law relationships between loss and key scaling variables, with an irreducible loss floor limiting best-case performance. The canonical form for language and acoustic models relates development (or test) loss $L$ to the number of model parameters $N$ and training examples $D$ :

$L(N, D) = \left( {L_\infty}^{1/\alpha} + \left(\frac{N_C}{N}\right)^{\alpha_N/\alpha} + \left(\frac{D_C}{D}\right)^{\alpha_D/\alpha} \right)^\alpha$

$L_\infty$ is the irreducible loss (task-intrinsic minimal achievable loss).
$N_C$ and $D_C$ are critical parameter and data scales.
$\alpha_N$ , $\alpha_D$ , and $\alpha$ are empirical exponents controlling convergence rate with respect to $N$ and $D$ (Droppo et al., 2021, Maloney et al., 2022).

Specialized scaling expressions also relate training or validation loss to compute budget $C$ :

$L(C) = \beta \cdot C^\alpha$

Optimal scaling of model and dataset sizes follows power-law relationships with compute:

$N_\text{opt} \propto C^a, \qquad D_\text{opt} \propto C^b$

where $a$ and $b$ are empirically determined exponents (Shen et al., 24 Jun 2024, Baniodeh et al., 9 Jun 2025).

In multilingual settings, scaling laws generalize to include language sampling ratios $p_i$ , yielding

$L_i(N, D, p_i) = \left[E_i + \frac{A_i}{N^{\alpha_i} + \frac{B_i}{D^{\beta_i}}}\right] p_i^{-\gamma_i}$

where each parameter is specific to a language family (He et al., 15 Oct 2024).

2. Mechanisms Underlying Scaling Laws

Theoretical understanding of scaling laws is grounded in statistical models and random matrix theory. If the eigenvalue spectrum of input data covariance follows a power-law decay (i.e., natural signals exhibit heavy-tailed spectral statistics), then nonlinear networks can progressively learn new modes as model or dataset size increases (Maloney et al., 2022). The application of a nonlinear random feature map in neural architectures extends this spectrum, yielding the characteristic power-law scaling in loss as a function of resources until a bottleneck is encountered (e.g., latent dimension or available data).

Scaling law breakdowns occur when the available resources surpass the intrinsic dimensionality or richness of the task data, at which point further scaling yields diminishing returns and the loss plateaus (Maloney et al., 2022, Baniodeh et al., 9 Jun 2025).

3. Empirical Validation and Cross-Domain Generality

Empirical studies have confirmed the validity of scaling laws over multiple orders of magnitude in both $N$ and $D$ for LLMs, acoustic models, and beyond (Droppo et al., 2021, Shen et al., 24 Jun 2024, Baniodeh et al., 9 Jun 2025). In all cases, loss decreases predictably as a power-law until an irreducible loss is reached.

For example, reducing LLM loss by 5% might require a 14-fold increase in data in the data-limited regime or a 25-fold increase in parameter count in the parameter-limited regime (Droppo et al., 2021). Optimal scaling requires jointly increasing $N$ and $D$ in a roughly fixed ratio, often with data size increasing sub-linearly relative to parameter count.

Experimental results demonstrate similar scaling behavior for efficient architectures such as linear-complexity transformers, where the loss follows a power-law trend with compute and optimal $N$ and $D$ allocations (Shen et al., 24 Jun 2024). For tasks outside LLMing, such as autonomous driving motion forecasting, similar scaling laws govern both model loss and downstream performance metrics (Baniodeh et al., 9 Jun 2025).

4. Implications for Model Design and Hyperparameter Optimization

Scaling laws enable practitioners to:

Predict achievable model performance under specific resource constraints without exhaustive searching.
Allocate compute budget optimally between model size and dataset size; e.g., optimal training of motion forecasting models requires parameter count to grow 1.5× faster than data size with increasing compute (Baniodeh et al., 9 Jun 2025).
Identify and avoid resource bottlenecks—e.g., ensure $D$ exceeds a data-dependent function of $N$ to avoid wasted parameters (Droppo et al., 2021).

Hyperparameter optimization can thus be formulated as a constrained maximization guided by empirical exponents derived from scaling fits, greatly reducing the need for extensive grid searches.

5. Generalization, Overparameterization, and Regression Theory

Scaling law phenomena challenge the classical expectation that overparameterization induces overfitting. Recent work shows that, under certain spectral decay assumptions, generalization error decreases at a power-law rate with effective model dimension—even in highly overparameterized regimes (Chen et al., 3 Mar 2025). In multiple and kernel regression, excess risk decomposes as:

$\mathbb{E}[L_R(V_n)] = \sigma^2 + \Theta\left(\frac{1}{M^{a-1}}\right) + \Theta\left(\frac{1}{(N_\text{eff} \gamma)^{(a-1)/a}}\right)$

Here, $M$ is sketch dimension, and $a$ is the eigenvalue decay exponent. This result extends power-law scaling from deep networks to linear, kernel, and sketched regression techniques (Chen et al., 3 Mar 2025).

Further, analysis of training dynamics in quadratically parameterized regression settings demonstrates that feature learning and adaptation are essential to achieving fast scaling behavior. Stochastic gradient descent, particularly with nonlinear parameterizations, enables identification and exploitation of key data modes, yielding generalization error decay rates that tightly match information-theoretic lower bounds (Ding et al., 13 Feb 2025).

6. Resource-Efficient and Domain-Specific Scaling

Recent research has extended PLM scaling laws to specialized domains, edge devices, and multilingual environments:

Multilingual PLMs: The loss for each language family in multilingual models depends only on its sampling ratio, yielding a scaling law with sampling ratio exponent $\gamma_i$ independent of $N$ and $D$ . Optimal mixtures can be derived on small models and transferred robustly to larger settings, reducing tuning cost (He et al., 15 Oct 2024).
Edge and Peripheral Devices: Models co-designed for hardware constraints (e.g., Peripheral LLMs) maintain scaling law benefits by minimizing the number of activated parameters and leveraging techniques like multi-head latent attention and squared ReLU activation, while performance still tracks parameter and data increases (Deng et al., 15 Mar 2025).
Domain-Specific Tasks: In autonomous driving, both open-loop and closed-loop task metrics (such as minADE and collision rates) improve as power-laws of compute, supporting scaling law application in robotics and planning domains (Baniodeh et al., 9 Jun 2025).

7. Limitations and Future Directions

Scaling laws are valid only within certain regimes—typically when neither model capacity nor data volume has saturated the intrinsic structure of the task data. Once a bottleneck or irreducible loss is reached, further scaling yields little benefit (Maloney et al., 2022, Baniodeh et al., 9 Jun 2025). Additionally, factors such as architectural “shape,” context length, data imbalance (in multilingual models), and memory-compute tradeoffs introduce domain-specific modifications to canonical scaling laws (Shen et al., 24 Jun 2024, He et al., 15 Oct 2024, Deng et al., 15 Mar 2025).

Future research directions include:

Investigating scaling behavior in extreme low-resource scenarios and for low-resource languages.
Integrating scaling law predictions with architectural innovations, regularization schemes, and reinforcement learning objectives.
Developing more refined theoretical analysis of deep, nonlinear networks with realistic data spectra to explain scaling law emergence in modern PLMs.

Empirical success and recent theoretical work strongly suggest that scaling laws will remain fundamental for navigating and optimizing resource allocation, architecture selection, and training strategy design for increasingly large and diverse LLMs and related systems.