Empirical Scaling Law Overview

Updated 29 July 2025

Empirical scaling law is a principle describing how system performance scales with changes in resources like model size, data volume, and compute, typically following a power-law relationship.
It is derived from systematic experimental and simulation data using methods such as statistical validation and parameter fitting, revealing universal trends across diverse scientific domains.
These laws guide optimal resource allocation, performance extrapolation, and model design in fields ranging from physics and ecology to modern machine learning.

Empirical scaling law describes the quantitative, often universal, relationship between critical performance metrics of a physical, biological, or algorithmic system and the scale of fundamental resources such as size, parameter count, data amount, or compute. In the context of contemporary scientific research, these laws are typically established from systematic experimental or simulation data and reveal how system behavior improves, saturates, or transitions as scaling variables are varied across orders of magnitude. The empirical scaling law is foundational in high-energy particle physics, ecology, nuclear fragmentation, atomic physics, and especially in the analysis, prediction, and optimization of large-scale machine learning models.

1. General Formulation and Universality

Empirical scaling laws often manifest as power-law or power-law-plus-constant relationships. In their canonical form, performance $L$ is described as

$L(x) = \alpha x^{-\beta} + \gamma$

where $x$ is the resource parameter (e.g., model size, dataset size, atomic number), $\beta$ is the scaling exponent, $\gamma$ is the asymptotic or irreducible error/floor, and $\alpha$ is a system- and observable-dependent constant.

This basic structure appears broadly—spanning contexts as diverse as:

Machine learning generalization error vs. training set size or model size (Hestness et al., 2017, Kaplan et al., 2020, Tissue et al., 20 Aug 2024, Li et al., 12 Jun 2025)
Particle cross-sections or amplitude features vs. energy or geometric quantities (Pancheri et al., 2014, Harman et al., 2018)
Biological metabolic rates vs. body mass (Ribeiro et al., 2021)
Ecological abundance/area relationships (Zaoli et al., 2017)

The universality and predictive accuracy across domains support the centrality of empirical scaling laws in both theory and practice.

2. Methodological Paradigms, Derivation, and Measurement

Empirical scaling laws are obtained via systematic, typically large-scale, experimental or computational campaigns. The methodology generally involves:

Varying one or more resource dimension(s) (e.g., $N$ for model size, $D$ for data, $C$ for compute) across several orders of magnitude
Fitting observed system performance (e.g., loss, cross-section, yield, metabolic rate) to parameterized models—commonly power laws, sometimes with corrections for higher-order or relativistic/percolation effects
Statistical validation (e.g., using $R^2$ goodness-of-fit, bootstrapping for confidence intervals (Ivgi et al., 2022), or cross-validation on extrapolated scales (Li et al., 12 Jun 2025))
Where possible, controlling or minimizing confounders such as hyperparameter differences, approximating the regime of "converged" or "sufficiently large" scale in other variables

Table: Common formalisms and empirical scaling law variables in different domains

Domain	Scaling Law Example	Resource Variables ( $x$ )
Deep Learning	$L \sim N^{-\beta}$ , $L \sim D^{-\beta}$	Parameters $N$ , dataset size $D$ , compute $C$
High-Energy $pp$ Scattering	$-t_{\rm dip} \sim 1/(2\pi R^2 f(s)^\alpha)$	Energy $s$ , radius $R(s)$ , amplitude/phase params
OCR	$E(N) = (\kappa / N)^{\beta}$	Model size $N$ , data volume $D$ , compute $C$
Nuclear Fragmentation	$\sigma_{\rm sc} = \cdots$ (see section 4)	Projectile and fragment mass/nucleon numbers, isospin-related exponents
Atomic Physics (DR)	$S^{\rm DR} = 1 / (m_1Z^2 + m_2Z + m_3 + m_4Z^{-2})$	Atomic number $Z$ , parametrized fit constants
Biology/Ecology	$B = B_0 M^\beta$ , $S \sim A^z$	Mass $M$ , area $A$ , scaling exponent $\beta, z$

3. Scaling Laws in Machine Learning and Data-driven Sciences

Neural scaling laws have become a key tool for understanding, predicting, and optimizing large-scale AI and deep learning systems:

Power-law scaling in model and data size: Validation loss or error tends to decrease predictably as $\sim N^{-\beta_N}$ and $\sim D^{-\beta_D}$ for model size $N$ and dataset size $D$ , often for many orders of magnitude (Kaplan et al., 2020, Hestness et al., 2017, Tissue et al., 20 Aug 2024, Li et al., 12 Jun 2025).
Joint scaling surfaces: Recent refinements model the loss as a function $L(N, D)$ rather than along single axes, revealing non-constant exponents and regime shifts (e.g., Farseer (Li et al., 12 Jun 2025)).
Limiting constants and saturation: Performance converges to a finite floor $\gamma$ under unlimited scaling—observed empirically for cross-entropy and error rates in both LLMs and recommendation systems (Ardalani et al., 2022).
Sample efficiency and compute-optimality: Larger models are not only more accurate but also more sample efficient, achieving target loss with fewer data or steps (Kaplan et al., 2020).
Transfer and pre-training scaling: The "transfer gap" quantifies the limit of downstream improvement achievable by upstream pre-training; empirical scaling laws for transfer surface support principled allocation between pre-training and fine-tuning (Barnett, 30 Aug 2024).

Table: Select empirical scaling exponents (compiled from data)

Task/Domain	Scaling Exponent $\beta$
Language modeling (model size)	$\approx 0.076$ (Kaplan et al., 2020)
Language modeling (data size)	$\approx 0.095$ (Kaplan et al., 2020)
Image classification (model size)	$0.3{-}0.5$ (Hestness et al., 2017, Sharma et al., 2020)
OCR error rate (model/data)	$0.18{-}0.32$ (Rang et al., 2023)
Recommendation models (data)	$0.09{-}0.12$ (Ardalani et al., 2022)

4. Scaling Laws in Physics, Ecology, and Other Domains

Empirical scaling law principles extend well beyond machine learning.

High-energy $pp$ scattering: Dip position and cross-sections obey geometrical scaling controlled by energy and characteristic radii. The scaling law for the dip in elastic $pp$ scattering uses

$-t_{\text{dip}}(s) = \frac{\tau_{BD}}{2 \pi R^2(s) [f(s)]^\alpha}$

and at accessible energies, the black disk limit $\mathcal{R} = \sigma_{\rm el}/\sigma_{\rm tot} \to 1/2$ is not yet reached (Pancheri et al., 2014).

Nuclear fragmentation: Scaled fragment cross sections in projectile fragmentation collapse onto universal curves when corrected for system and fragment size/asymmetry via a multi-factor empirical scaling formula (see details in (Song et al., 2017)).
Atomic physics: Dielectronic recombination strengths scale nontrivially with atomic number $Z$ , with higher-order relativistic corrections leading to empirically motivated forms such as $S^{\rm DR} = 1/(m_1Z^2 + m_2Z + m_3 + m_4Z^{-2})$ (Harman et al., 2018).
Ecology: Patterns such as the species–area relationship, abundance spectra, and allometric metabolic scaling $B = B_0 M^{\beta}$ are described by power laws. Empirical scaling exponents exhibit covariation and are subject to constraints arising from finite resource supply, as formalized in theoretical frameworks linking exponents algebraically (Zaoli et al., 2017, Ribeiro et al., 2021).
Percolation theory: Scaling laws can be rooted in the data distribution, with percolation-derived models generating critical regimes that align either with manifold approximation or discrete quantization views of neural scaling (Brill, 10 Dec 2024).

5. Empirical Scaling Law as Predictive, Diagnostic, and Design Tool

Empirical scaling laws are practically indispensable for:

Extrapolation: Predicting large-scale model performance from small-scale experiments with high reliability and robustness; improved scaling surfaces (e.g., Farseer (Li et al., 12 Jun 2025)) enable confident ablation studies and protocol selection for large LLMs.
Optimal resource allocation: Determining compute-optimal model and data size balancing, particularly where cost or availability strongly constrain one axis (Kaplan et al., 2020, Li et al., 12 Jun 2025). In pre-training and embodied AI, scaling exponents shift with tokenizer compression rates and architecture choices, altering optimal D/N or compute ratios (Pearce et al., 7 Nov 2024).
Debugging and model selection: Scaling trends enable practitioners to detect unusual training dynamics (e.g., underfitting due to optimizer faults if empirical curves deviate from predicted laws (Ivgi et al., 2022)), and to compare the long-term scaling merits of alternative architectures or objectives.
Cross-domain design insight: The broad applicability of scaling law schemas—from $pp$ scattering to genome annotation—highlights generalizable resource-performance relationships and supports the hypothesis that empirical scaling law is a unifying principle in complex systems.

6. Theoretical Underpinnings and Regime Taxonomy

Multiple theoretical frameworks underpin empirical scaling law observations and suggest mechanisms for their universality and limitations:

Power-law spectra and random feature models: Scaling exponents can be derived from the spectral decay of data covariance or kernel matrices, with neural networks and nonlinear random features amplifying and extending the effective power-law regime (Maloney et al., 2022).
Manifold dimension: The scaling exponent $\alpha \approx 4/d$ is predicted for smooth regression on a data manifold of intrinsic dimension $d$ , bounding the exponent observable in real tasks (Sharma et al., 2020).
Dual regimes in neural models: Variance-limited and resolution-limited regimes yield different scaling behavior ( $\sim 1/D$ , $\sim 1/P$ , or dimension-dependent exponents), and their transitions are governed by the interplay of model/data scale with data smoothness and structure (Bahri et al., 2021, Brill, 10 Dec 2024).
Percolation-theoretic models: Data generated near criticality leads to power-law–distributed clusters whose learning recapitulates empirical scaling law exponents; both quantization and manifold-view scaling emerge as limits (Brill, 10 Dec 2024).
Saturation and breakdown: As scaling variables reach the limits set by the data’s intrinsic structure (e.g., finite spectrum, latent manifold size), test loss plateaus, and scaling laws break down (Maloney et al., 2022).

7. Limitations, Controversies, and Future Directions

Empirical scaling laws, while robust, exhibit context-dependent exponents and domain-specific inflection points:

Saturation floors or diminishing returns set practical scaling limits, as seen in recommendation systems (parameter scaling is “out of steam” for DLRM architectures (Ardalani et al., 2022)).
Deviations between empirical and theoretically predicted exponents remain for complex tasks (Hestness et al., 2017, Kaplan et al., 2020).
Refinements resolve some of these issues: learning rate annealing is now incorporated into dynamic loss-surface scaling laws allowing fine-grained prediction throughout training (Tissue et al., 20 Aug 2024).
Transfer scaling laws expose “transfer gaps” affecting pre-training value; optimizing joint resource allocation for pre-training versus fine-tuning is now quantifiable (Barnett, 30 Aug 2024).

Active research aims to:

Theoretically unify and explain empirical exponents in deep, real-world systems
Extend scaling law surfaces (e.g. L(N, D)) to more modalities, architectures, and downstream task types (Li et al., 12 Jun 2025)
Incorporate dynamic training effects—such as data ordering, learning rate scheduling—into unified scaling frameworks (Tissue et al., 20 Aug 2024)
Connect scaling law plateaus to underlying data geometry and spectral distributions

Empirical scaling law thus constitutes a cornerstone of predictive science, linking quantitative resource scaling to attainable performance, guiding design, and constraining theoretical interpretation across disparate disciplines.