Papers
Topics
Authors
Recent
Search
2000 character limit reached

Empirical Scaling Laws: Theory and Practice

Updated 23 March 2026
  • Empirical scaling laws are power-law relationships that quantify how observables change with system size, data, or compute.
  • They use log–log linear fittings and rigorous statistical methods to extract key exponents, informing performance improvements and resource allocation.
  • Applications span deep learning, astrophysics, ecology, and more, providing actionable insights for predictive modeling and efficient system design.

Empirical scaling laws are quantitative relationships, commonly power laws, that describe how a system’s observable properties change as critical parameters—such as system size, resource allocation, or environmental factors—are varied across orders of magnitude. Originally identified in statistical physics and the natural sciences, empirical scaling laws now underpin predictive understanding in domains ranging from astrophysics to deep learning, recommender systems, plasma physics, ecology, and global weather modeling. Their universality and quantitative rigidity enable extrapolation, optimal resource allocation, and insights into mechanistic principles.

1. Mathematical Forms and Ubiquity of Scaling Laws

Empirical scaling laws frequently manifest as power-law or power-law-plus-constant relationships between an observable—such as error or efficiency—and scaling variables, such as number of parameters NN, dataset size DD, or training compute CC: L(X)  =  aXXαX+bXL(X)\;=\; a_X \, X^{-\alpha_X} + b_X where L(X)L(X) is the observable (e.g., test loss, error, performance), XX is the scaling variable, aXa_X is a prefactor, αX\alpha_X is the scaling exponent, and bXb_X is the irreducible offset (often negligible in deep learning contexts) (Kaplan et al., 2020, Henighan et al., 2020, Lin et al., 2024, Ardalani et al., 2022, Trikha et al., 26 Sep 2025, Yu et al., 26 Feb 2026). The scaling exponents αX\alpha_X quantify how rapidly improved performance is unlocked as a function of increased resources.

More complex settings, such as transfer learning or ecological systems, can involve multi-variable scaling laws with additive or multiplicative structure, often reflecting fundamental constraints: L(p,f)  =  (Apα+G)fβ+EL(p,f)\;=\;(A\,p^{-\alpha} + G)\,f^{-\beta} + E where pp and ff are pre-training and fine-tuning data volumes, GG captures transfer inefficiency, and EE is the irreducible loss (Barnett, 2024).

Historically, empirical scaling laws were first noted in physical and biological systems—e.g., metabolic scaling in biology B=B0MβB\,=\,B_0\,M^\beta (Ribeiro et al., 2021), or power-law systematics in isotopic abundances (0901.3592)—but analogous forms now govern statistical and machine learning systems, and have been theoretically and numerically validated in linear regression (Lin et al., 2024), random feature models (Maloney et al., 2022), and neural architectures (Kaplan et al., 2020, Ivgi et al., 2022, Ngo et al., 10 Oct 2025, Trikha et al., 26 Sep 2025).

2. Methodological Approaches: Fitting and Interpreting Scaling Laws

Determining scaling laws requires systematic variation of the resource of interest (model size, data, compute, or energy), control of confounding factors, and measurement of performance metrics over wide dynamic ranges. Standard practice involves:

The fitting process is robust only when the system remains in its pre-saturation regime; at sufficiently large resource values, performance often asymptotes to an irreducible loss floor, shifting the curve from steep to flat (e.g., in deep CTR models (Ardalani et al., 2022)).

3. Canonical Domains and Quantitative Regularities

Deep Neural Networks and Generative Modeling

Transformer-based LLMs, image/video autoregressive generative models, recommendation systems, and neural material models display remarkably consistent scaling trends for loss and accuracy:

  • Language modeling loss: L(N)N0.07L(N) \sim N^{-0.07}, L(D)D0.095L(D) \sim D^{-0.095}, L(C)C0.057L(C) \sim C^{-0.057} (Kaplan et al., 2020, Henighan et al., 2020).
  • Image/modeling: exponents $0.11 - 0.24$ for NN, with similar structure for DD and CC (Henighan et al., 2020).
  • Neural material modeling: L(P)P0.383L(P) \sim P^{-0.383} for equivariant architectures (EquiformerV2), versus L(P)P0.120L(P) \sim P^{-0.120} for conventional transformers—a threefold difference (Trikha et al., 26 Sep 2025).
  • Recommender systems: L(D)D0.10+0.98L(D) \sim D^{-0.10} + 0.98, L(N)N0.37+0.98L(N) \sim N^{-0.37} + 0.98 (saturated). Returns for increasing NN eventually flatten, while data scaling maintains steady improvements (Ardalani et al., 2022).

Scientific and Natural Systems

  • Biological allometry: B=B0MβB = B_0\,M^\beta with β2/3\beta \sim 2/3 (Rubner, surface area hypothesis) or β3/4\beta \sim 3/4 (West-Brown-Enquist, fractal branching networks), depending on taxon or body size (Ribeiro et al., 2021).
  • Ecological scaling: Power-law distributions of species abundance, body size, and area relationships, with tightly linked exponents (e.g., η=δ+γ\eta = \delta + \gamma) reflecting community dynamics under resource constraints (Zaoli et al., 2017).
  • Isotopic abundances in nucleosynthesis: Two empirical abundance regularities for p- and s-nuclei (first scaling: Rs/p(Z)Ns(Z)/Np(Z)23R_{s/p}(Z) \equiv N_s(Z)/N_p(Z) \approx 23; second scaling: Rp/p(Z)1R_{p/p}(Z) \approx 1 over wide ZZ) (0901.3592).

Physics and Other Regimes

  • Self-focused laser-plasma interactions: Nonlinear relationships between plasma density, laser energy, maximum normalized vector potential, depletion length, channel radius, and wakefield amplitude—each described by empirical power laws validated via particle-in-cell simulations (Martelli et al., 5 Jun 2025).
  • Weather modeling: Aurora's validation loss obeys L(D)D0.51L(D) \sim D^{-0.51} and L(N)N0.3L(N) \sim N^{-0.3}; width scaling is paramount in meteorological models, unlike language modeling (Yu et al., 26 Feb 2026).

4. Structural Variations and Theoretical Accounts

Power-law exponents and forms depend on domain, architecture, inductive bias, and symmetry:

  • Equivariant architectures such as EGNN, GemNet-OC, and eSEN exhibit substantially steeper data and parameter scaling exponents than non-equivariant models, with performance differentials increasing at larger scales (Ngo et al., 10 Oct 2025, Trikha et al., 26 Sep 2025).
  • Theoretical models attribute neural scaling laws to statistical properties of the data: e.g., power-law latent spectrum in random-feature models (Maloney et al., 2022), percolation-based power-law-distributed subtasks in realistic data (Brill, 2024), and power-law covariance spectra in linear regression with sketched covariates (Lin et al., 2024).
  • Finite latent dimension, spectral support, or irreducible entropy can induce breakdown of scaling at large scales (flattening/plateaus) (Maloney et al., 2022, Henighan et al., 2020, Ardalani et al., 2022).
  • Linked ecological exponents illustrate that constraints (resource, energy, space) produce multiple scaling relationships that co-vary deterministically (Zaoli et al., 2017).

5. Practical Implications: Prediction, Resource Allocation, and Extrapolation

Empirical scaling laws enable high-fidelity extrapolation and resource optimization in model-building and scientific experimentation:

  • Accurate performance prediction for larger (yet-untrained) models, enabling cost-effective model selection and debugging strategies (Ivgi et al., 2022).
  • Compute-optimal rules for resource allocation: in language modeling, allocating most compute to model size (NC0.7N\propto C^{0.7}) and less to data (DC0.3D\propto C^{0.3}) (Kaplan et al., 2020, Henighan et al., 2020); in weather modeling, data scaling dominates over parameter scaling (Yu et al., 26 Feb 2026).
  • Identification of diminishing returns (“saturation regime”): e.g., parameter-scaling efficiency in deep recommender models is exhausted far before data-scaling, compelling a pivot toward ingesting more data rather than expanding NN (Ardalani et al., 2022).
  • In transfer learning, scaling laws quantify when further pre-training is effective (small transfer gap GG) versus when downstream data acquisition is necessary (large GG) (Barnett, 2024, Hernandez et al., 2021).
  • Real-time model development efficiency: pilot experiments with small-scale models can reveal scaling exponents that drive architecture, hyperparameter, or dataset size choices (Ivgi et al., 2022, Barnett, 2024).

6. Open Problems, Limitations, and Generalization

Empirical scaling laws, while highly predictive within observed regimes, are subject to various domain- and regime-dependent limitations:

  • They often break down upon reaching the system’s inherent entropy, finite latent dimensionality, or resource-imposed ceilings, resulting in performance plateaus (Maloney et al., 2022, Henighan et al., 2020, Ardalani et al., 2022).
  • Exponents are architecture- and task-dependent; optimal scaling for one modality or architecture may not transfer to another (Ngo et al., 10 Oct 2025, Trikha et al., 26 Sep 2025, Brill, 2024).
  • There remain open questions on the theoretical origin and universality of observed exponents (e.g., why neural transformers universally favor NC0.7N\propto C^{0.7}, DC0.3D\propto C^{0.3} (Henighan et al., 2020), or why scaling economies exist in biological systems (Ribeiro et al., 2021)).
  • Predicting the regime and value of scaling exponents from first principles remains a frontier, and devising interventions (pruning, active learning, symmetry injection) that fundamentally alter scaling trajectories is an active area (Brill, 2024, Ngo et al., 10 Oct 2025).

7. Summary Table of Exemplary Scaling Laws Across Domains

Domain/Setting Loss/Obs. Scaling Law Source
Transformer LMs Cross-entropy loss L(N)N0.076L(N) \sim N^{-0.076}, L(D)D0.095L(D) \sim D^{-0.095} (Kaplan et al., 2020)
Gen. Image (8x8) CE loss / img L(N)=3.12+(N0/N)0.24L(N) = 3.12 + (N_0/N)^{0.24} (Henighan et al., 2020)
Material modeling MSE Loss L(P)=776P0.383L(P) = 776\,P^{-0.383} (EquiformerV2), 175P0.120175\,P^{-0.120} (Tr.) (Trikha et al., 26 Sep 2025)
Neural force fields MAE L(N)=ANαL(N)=A N^{-\alpha}, α\alpha up to $0.82$ (eSEN) (Ngo et al., 10 Oct 2025)
Rec. systems (DLRM) Norm. log-loss L(D)=0.07D0.10+0.98L(D)=0.07 D^{-0.10}+0.98, L(N)=0.45N0.37+0.98L(N)=0.45N^{-0.37}+0.98 (Ardalani et al., 2022)
Plasma physics aP,maxa_{P,\max}, EcE_c aP,maxne/ncEL(1ne/nc)a_{P,\max}\sim\sqrt{n_e/n_c\,E_L}(1-n_e/n_c) (Martelli et al., 5 Jun 2025)
Weather models Val. loss L(D)D0.5L(D)\sim D^{-0.5} (Aurora), L(N)N0.3L(N)\sim N^{-0.3} (Yu et al., 26 Feb 2026)
Biology (Metabolic) BB B=B0M2/3B=B_0 M^{2/3} or B=B0M3/4B=B_0 M^{3/4} (Ribeiro et al., 2021)
Ecology (SAR etc.) S(A)S(A), P(mA)P(m|A) S(A)=AzS(A)=A^z, P(mA)=mδP(m|A)=m^{-\delta}, exponents linked (Zaoli et al., 2017)

Exponents and scaling structure are task- and architecture-dependent, providing both constraints and opportunities for model optimization, domain-invariant prediction, and deeper theoretical understanding.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Empirical Scaling Laws.