Empirical Scaling Laws: Theory and Practice

Updated 23 March 2026

Empirical scaling laws are power-law relationships that quantify how observables change with system size, data, or compute.
They use log–log linear fittings and rigorous statistical methods to extract key exponents, informing performance improvements and resource allocation.
Applications span deep learning, astrophysics, ecology, and more, providing actionable insights for predictive modeling and efficient system design.

Empirical scaling laws are quantitative relationships, commonly power laws, that describe how a system’s observable properties change as critical parameters—such as system size, resource allocation, or environmental factors—are varied across orders of magnitude. Originally identified in statistical physics and the natural sciences, empirical scaling laws now underpin predictive understanding in domains ranging from astrophysics to deep learning, recommender systems, plasma physics, ecology, and global weather modeling. Their universality and quantitative rigidity enable extrapolation, optimal resource allocation, and insights into mechanistic principles.

1. Mathematical Forms and Ubiquity of Scaling Laws

Empirical scaling laws frequently manifest as power-law or power-law-plus-constant relationships between an observable—such as error or efficiency—and scaling variables, such as number of parameters $N$ , dataset size $D$ , or training compute $C$ : $L(X)\;=\; a_X \, X^{-\alpha_X} + b_X$ where $L(X)$ is the observable (e.g., test loss, error, performance), $X$ is the scaling variable, $a_X$ is a prefactor, $\alpha_X$ is the scaling exponent, and $b_X$ is the irreducible offset (often negligible in deep learning contexts) (Kaplan et al., 2020, Henighan et al., 2020, Lin et al., 2024, Ardalani et al., 2022, Trikha et al., 26 Sep 2025, Yu et al., 26 Feb 2026). The scaling exponents $\alpha_X$ quantify how rapidly improved performance is unlocked as a function of increased resources.

More complex settings, such as transfer learning or ecological systems, can involve multi-variable scaling laws with additive or multiplicative structure, often reflecting fundamental constraints: $L(p,f)\;=\;(A\,p^{-\alpha} + G)\,f^{-\beta} + E$ where $p$ and $f$ are pre-training and fine-tuning data volumes, $G$ captures transfer inefficiency, and $E$ is the irreducible loss (Barnett, 2024).

Historically, empirical scaling laws were first noted in physical and biological systems—e.g., metabolic scaling in biology $B\,=\,B_0\,M^\beta$ (Ribeiro et al., 2021), or power-law systematics in isotopic abundances (0901.3592)—but analogous forms now govern statistical and machine learning systems, and have been theoretically and numerically validated in linear regression (Lin et al., 2024), random feature models (Maloney et al., 2022), and neural architectures (Kaplan et al., 2020, Ivgi et al., 2022, Ngo et al., 10 Oct 2025, Trikha et al., 26 Sep 2025).

2. Methodological Approaches: Fitting and Interpreting Scaling Laws

Determining scaling laws requires systematic variation of the resource of interest (model size, data, compute, or energy), control of confounding factors, and measurement of performance metrics over wide dynamic ranges. Standard practice involves:

Constructing log–log plots of the observable versus the scaling variable.
Fitting linear or affine models in this space to extract exponents and prefactors.
Using cross-validation, random seeds, and bootstrapping to estimate uncertainty in exponents and test the goodness-of-fit (Ivgi et al., 2022, Lin et al., 2024, Ngo et al., 10 Oct 2025).
Isolating mechanisms by direct model comparison (e.g., equivariant versus non-equivariant architectures (Ngo et al., 10 Oct 2025); model-shape ablations (Kaplan et al., 2020, Yu et al., 26 Feb 2026)).
In transfer scenarios, quantifying effective data multipliers and transfer gaps across diverse domains (Hernandez et al., 2021, Barnett, 2024).

The fitting process is robust only when the system remains in its pre-saturation regime; at sufficiently large resource values, performance often asymptotes to an irreducible loss floor, shifting the curve from steep to flat (e.g., in deep CTR models (Ardalani et al., 2022)).

3. Canonical Domains and Quantitative Regularities

Deep Neural Networks and Generative Modeling

Transformer-based LLMs, image/video autoregressive generative models, recommendation systems, and neural material models display remarkably consistent scaling trends for loss and accuracy:

Language modeling loss: $L(N) \sim N^{-0.07}$ , $L(D) \sim D^{-0.095}$ , $L(C) \sim C^{-0.057}$ (Kaplan et al., 2020, Henighan et al., 2020).
Image/modeling: exponents $0.11 - 0.24$ for $N$ , with similar structure for $D$ and $C$ (Henighan et al., 2020).
Neural material modeling: $L(P) \sim P^{-0.383}$ for equivariant architectures (EquiformerV2), versus $L(P) \sim P^{-0.120}$ for conventional transformers—a threefold difference (Trikha et al., 26 Sep 2025).
Recommender systems: $L(D) \sim D^{-0.10} + 0.98$ , $L(N) \sim N^{-0.37} + 0.98$ (saturated). Returns for increasing $N$ eventually flatten, while data scaling maintains steady improvements (Ardalani et al., 2022).

Scientific and Natural Systems

Biological allometry: $B = B_0\,M^\beta$ with $\beta \sim 2/3$ (Rubner, surface area hypothesis) or $\beta \sim 3/4$ (West-Brown-Enquist, fractal branching networks), depending on taxon or body size (Ribeiro et al., 2021).
Ecological scaling: Power-law distributions of species abundance, body size, and area relationships, with tightly linked exponents (e.g., $\eta = \delta + \gamma$ ) reflecting community dynamics under resource constraints (Zaoli et al., 2017).
Isotopic abundances in nucleosynthesis: Two empirical abundance regularities for p- and s-nuclei (first scaling: $R_{s/p}(Z) \equiv N_s(Z)/N_p(Z) \approx 23$ ; second scaling: $R_{p/p}(Z) \approx 1$ over wide $Z$ ) (0901.3592).

Physics and Other Regimes

Self-focused laser-plasma interactions: Nonlinear relationships between plasma density, laser energy, maximum normalized vector potential, depletion length, channel radius, and wakefield amplitude—each described by empirical power laws validated via particle-in-cell simulations (Martelli et al., 5 Jun 2025).
Weather modeling: Aurora's validation loss obeys $L(D) \sim D^{-0.51}$ and $L(N) \sim N^{-0.3}$ ; width scaling is paramount in meteorological models, unlike language modeling (Yu et al., 26 Feb 2026).

4. Structural Variations and Theoretical Accounts

Power-law exponents and forms depend on domain, architecture, inductive bias, and symmetry:

Equivariant architectures such as EGNN, GemNet-OC, and eSEN exhibit substantially steeper data and parameter scaling exponents than non-equivariant models, with performance differentials increasing at larger scales (Ngo et al., 10 Oct 2025, Trikha et al., 26 Sep 2025).
Theoretical models attribute neural scaling laws to statistical properties of the data: e.g., power-law latent spectrum in random-feature models (Maloney et al., 2022), percolation-based power-law-distributed subtasks in realistic data (Brill, 2024), and power-law covariance spectra in linear regression with sketched covariates (Lin et al., 2024).
Finite latent dimension, spectral support, or irreducible entropy can induce breakdown of scaling at large scales (flattening/plateaus) (Maloney et al., 2022, Henighan et al., 2020, Ardalani et al., 2022).
Linked ecological exponents illustrate that constraints (resource, energy, space) produce multiple scaling relationships that co-vary deterministically (Zaoli et al., 2017).

5. Practical Implications: Prediction, Resource Allocation, and Extrapolation

Empirical scaling laws enable high-fidelity extrapolation and resource optimization in model-building and scientific experimentation:

Accurate performance prediction for larger (yet-untrained) models, enabling cost-effective model selection and debugging strategies (Ivgi et al., 2022).
Compute-optimal rules for resource allocation: in language modeling, allocating most compute to model size ( $N\propto C^{0.7}$ ) and less to data ( $D\propto C^{0.3}$ ) (Kaplan et al., 2020, Henighan et al., 2020); in weather modeling, data scaling dominates over parameter scaling (Yu et al., 26 Feb 2026).
Identification of diminishing returns (“saturation regime”): e.g., parameter-scaling efficiency in deep recommender models is exhausted far before data-scaling, compelling a pivot toward ingesting more data rather than expanding $N$ (Ardalani et al., 2022).
In transfer learning, scaling laws quantify when further pre-training is effective (small transfer gap $G$ ) versus when downstream data acquisition is necessary (large $G$ ) (Barnett, 2024, Hernandez et al., 2021).
Real-time model development efficiency: pilot experiments with small-scale models can reveal scaling exponents that drive architecture, hyperparameter, or dataset size choices (Ivgi et al., 2022, Barnett, 2024).

6. Open Problems, Limitations, and Generalization

Empirical scaling laws, while highly predictive within observed regimes, are subject to various domain- and regime-dependent limitations:

They often break down upon reaching the system’s inherent entropy, finite latent dimensionality, or resource-imposed ceilings, resulting in performance plateaus (Maloney et al., 2022, Henighan et al., 2020, Ardalani et al., 2022).
Exponents are architecture- and task-dependent; optimal scaling for one modality or architecture may not transfer to another (Ngo et al., 10 Oct 2025, Trikha et al., 26 Sep 2025, Brill, 2024).
There remain open questions on the theoretical origin and universality of observed exponents (e.g., why neural transformers universally favor $N\propto C^{0.7}$ , $D\propto C^{0.3}$ (Henighan et al., 2020), or why scaling economies exist in biological systems (Ribeiro et al., 2021)).
Predicting the regime and value of scaling exponents from first principles remains a frontier, and devising interventions (pruning, active learning, symmetry injection) that fundamentally alter scaling trajectories is an active area (Brill, 2024, Ngo et al., 10 Oct 2025).

7. Summary Table of Exemplary Scaling Laws Across Domains

Domain/Setting	Loss/Obs.	Scaling Law	Source
Transformer LMs	Cross-entropy loss	$L(N) \sim N^{-0.076}$ , $L(D) \sim D^{-0.095}$	(Kaplan et al., 2020)
Gen. Image (8x8)	CE loss / img	$L(N) = 3.12 + (N_0/N)^{0.24}$	(Henighan et al., 2020)
Material modeling	MSE Loss	$L(P) = 776\,P^{-0.383}$ (EquiformerV2), $175\,P^{-0.120}$ (Tr.)	(Trikha et al., 26 Sep 2025)
Neural force fields	MAE	$L(N)=A N^{-\alpha}$ , $\alpha$ up to $0.82$ (eSEN)	(Ngo et al., 10 Oct 2025)
Rec. systems (DLRM)	Norm. log-loss	$L(D)=0.07 D^{-0.10}+0.98$ , $L(N)=0.45N^{-0.37}+0.98$	(Ardalani et al., 2022)
Plasma physics	$a_{P,\max}$ , $E_c$	$a_{P,\max}\sim\sqrt{n_e/n_c\,E_L}(1-n_e/n_c)$	(Martelli et al., 5 Jun 2025)
Weather models	Val. loss	$L(D)\sim D^{-0.5}$ (Aurora), $L(N)\sim N^{-0.3}$	(Yu et al., 26 Feb 2026)
Biology (Metabolic)	$B$	$B=B_0 M^{2/3}$ or $B=B_0 M^{3/4}$	(Ribeiro et al., 2021)
Ecology (SAR etc.)	$S(A)$ , $P(m\|A)$	$S(A)=A^z$ , $P(m\|A)=m^{-\delta}$ , exponents linked	(Zaoli et al., 2017)

Exponents and scaling structure are task- and architecture-dependent, providing both constraints and opportunities for model optimization, domain-invariant prediction, and deeper theoretical understanding.