Empirical Neural Scaling Laws

Updated 30 March 2026

Empirical neural scaling laws are mathematical models that describe how performance metrics decrease according to power-law relationships with increased dataset size, model capacity, and compute.
They classify scaling into regime types—dataset-limited, model-limited, and compute-limited—with distinct exponents that guide practical resource allocation and model design.
Architectural inductive biases, such as symmetry equivariance, steepen the scaling curves, offering higher returns per resource and influencing optimal scaling strategies.

Empirical neural scaling laws describe how the performance of deep neural networks varies systematically as critical resource variables—primarily dataset size, model capacity, and compute—are increased, usually following power-law or piecewise power-law relationships over wide dynamic ranges. These laws are foundational to contemporary scientific and industrial practice in machine learning, guiding optimal allocation of compute, informing empirical model selection, and providing quantitative benchmarks for extrapolating to novel scales. Recent empirical work has also illuminated the dependence of scaling exponents on architecture class and inductive bias.

1. Canonical Power-Law Forms and Empirical Regimes

Empirical scaling laws most commonly express the predictive loss $L$ (or equivalently, error metrics) as a function of a single scaling variable $N$ (which may be the number of model parameters, dataset size, or compute) via a simple power-law,

$L = \alpha \cdot N^{-\beta}$

where $\alpha$ and $\beta$ are fitted constants and $N$ denotes the currently limited resource, while all other axes are held at surplus. This scaling persists over many orders of magnitude, subject to eventual saturation or phase transitions at extremely large scales (Trikha et al., 26 Sep 2025).

Data, Model, and Compute Axes

Scaling can be empirically characterized along three principal axes:

Dataset-scaling (fixed model, fixed compute): $L \sim D^{-\beta_D}$
Model-scaling (fixed data, fixed compute): $L \sim M^{-\beta_M}$
Compute-scaling (Pareto frontier): $L \sim C^{-\beta_C}$

Each regime exhibits a distinct scaling exponent. Experiments in neural material modeling, for example, revealed $\beta_D \approx 0.242, \beta_M \approx 0.383, \beta_C \approx 0.339$ for an equivariant GNN (EquiformerV2), substantially outpacing transformer baselines, which exhibited much lower exponents ( $\beta_D \approx 0.052, \beta_M \approx 0.120$ ) (Trikha et al., 26 Sep 2025).

Scaling Regimes

Dataset-limited: Small datasets yield overfitting, with loss decreasing monotonically with increasing data until hitting the model/compute bottleneck.
Model-limited: At fixed data, increasing model size initially reduces loss, but beyond a certain model scale, gains saturate unless data and compute are increased concomitantly.
Compute-limited: For fixed $D, M$ , increasing compute reduces loss along a Pareto frontier, typically following a clear power-law for architectures with appropriate inductive bias.

Transitions ("knees" or "plateaus") often appear at scale points where the regime bottleneck migrates from data- to model- to compute-limited.

2. Architecture Dependence and Inductive Bias

Empirical exponents are not universal; they are highly architecture-dependent. Inductive biases such as symmetry equivariance steepen scaling exponents, providing greater "loss reduction per resource" (Ngo et al., 10 Oct 2025, Trikha et al., 26 Sep 2025).

Empirical exponents by architecture (Ngo et al., 10 Oct 2025): | Architecture | $\alpha$ (Params) | $\beta$ (Data) | $\gamma$ (Compute) | |------------------------|-------------------|----------------|--------------------| | MPNN (no symmetry) | 0.28 | 0.31 | 0.14 | | EGNN (E(n)-equivariant)| 0.39 | 0.39 | 0.17 | | GemNet-OC (body order 4)| 0.52 | 0.50 | 0.26 | | eSEN (high-order equiv)| 0.82 | 0.75 | 0.40 |

Higher-order equivariant representations (e.g., eSEN with higher $\ell_{max}$ ) yield strictly better exponents. This results in increasing returns to scale, and the gap between simple and strongly inductive architectures (e.g., transformers vs. equivariant GNNs in materials) widens with scale (Trikha et al., 26 Sep 2025).

3. Fitting, Regime Changes, and Extrapolation

While single power-law fits are dominant, real neural learning curves may exhibit broken power-law behavior, inflections, or abrupt transitions, especially near emergent or phase-like phenomena ("double descent", rapid accuracy jumps). The "Broken Neural Scaling Law" (BNSL) functional form introduces smoothly joined power-law segments across $n$ breaks: $y(x) = a + b x^{-c_0} \prod_{i=1}^{n} [1 + (x/d_i)^{1/f_i}]^{-c_i f_i}$ accommodating regimes of different exponents and allowing for accurate extrapolation even through non-monotonicities and inflection points, as illustrated in vision, language, and generative models (Caballero et al., 2022). BNSL is crucial when forecasting across orders of magnitude, especially when large-scale phase transitions are anticipated.

4. Practical Guidelines and Compute-Optimal Scaling

Empirical scaling laws inform "compute-optimal" recipes. For most overparameterized regimes in contemporary deep learning,

Compute-optimal allocation: Data size $D$ and model size $M$ should scale together (equivalently, $M \propto D$ ), as justified both empirically and by rigorous information-theoretic analysis (Jeon et al., 2024, Ngo et al., 10 Oct 2025). Practically, this means to minimize loss at fixed compute $C$ , allocate $C$ so that both $M$ and $D$ are maximized subject to $C \approx \kappa M D$ (Trikha et al., 26 Sep 2025, Ngo et al., 10 Oct 2025).

Empirical exponents directly determine resource allocation strategies. For example, if $\beta_M > \beta_D$ , prioritizing model scaling yields faster returns than increasing dataset size, and vice versa (Trikha et al., 26 Sep 2025). For architectures with strong physical priors (e.g., equivariant GNNs), scaling model size typically offers the highest ROI.

Task and label diversity as well as in-domain versus out-of-distribution adaptation significantly affect scaling, with cross-domain few-shot tasks often converging substantially faster in data than within-domain tasks (Prato et al., 2021).

5. Beyond Parametric Laws: Task Diversity and Predictive Methodologies

While aggregate loss metrics display power-law scaling, the landscape of downstream tasks is highly heterogeneous, with some tasks plateauing, improving non-monotonically, or even degrading at scale. Predicting downstream task performance from aggregate metrics (such as overall validation loss) is limited due to two phenomena:

Averaging token-level losses obscures subtle distributional effects
No simple parametric family faithfully describes all observed scaling behaviors

Neural-based meta-predictors, such as the NeuNeu model, frame scaling prediction as time-series extrapolation, leveraging both token-level loss distributions and temporal accuracy trends. This approach achieves significantly lower mean absolute error and better calibration on held-out tasks than power-law or logistic parametric fits, and generalizes zero-shot to new tasks and architectures (Hu et al., 27 Jan 2026).

6. Empirical Results: Quantitative Laws and Regime Analysis

The magnitude and nature of exponents vary by domain and resource, sometimes by an order of magnitude:

Domain	Exponent Type	Range	Notable Reference
Lang. Model	$\beta_D, \beta_M$	0.07–0.1 (Kaplan et al., Chinchilla)	(Kaplan et al., 2020)
Materials	$\beta_D, \beta_M, \beta_C$	0.24–0.38 (EquiformerV2, GNN)	(Trikha et al., 26 Sep 2025)
Force Fields	$\alpha, \beta, \gamma$	0.28 (MPNN) up to 0.82 (eSEN)	(Ngo et al., 10 Oct 2025)
Vision	$\gamma_{tot}$ (CIFAR10)	0.25–0.54 (CNNs, ResNets)	(D'Amico et al., 19 May 2025)
Few-shot	$\|\alpha\|$ (cross-domain)	0.7–1.1	(Prato et al., 2021)

Empirical laws remain robust under several protocol and task variations, provided data and models are sufficiently large and regimes are properly identified.

7. Limitations, Open Questions, and Future Directions

Empirical scaling law fits are valid only within their measured span; extrapolation beyond observed "breaks" suffers intrinsic unpredictability, particularly at sharp transitions or emergent thresholds (Caballero et al., 2022). For architectures like transformers in materials domains, empirical power-laws break for $C < 10^{15}$ FLOPs and may resume with new exponents above this threshold (Trikha et al., 26 Sep 2025).

Significant open directions include:

Assessing scaling exponents for yet-untested model classes (e.g., GemNet, SchNet, fully connected networks in materials science).
Understanding the impact of data-augmentation and physics-informed regularization on scaling in weakly inductive domains.
Extending empirical laws and theory to other data-rich scientific fields or generative regimes.
Deriving exponents from first-principles statistical structure of real-world data, as attempted for LLMs by mapping token-correlation and conditional-entropy decays to exponents (Cagnetta et al., 7 Feb 2026).
Developing predictive, uncertainty-aware meta-models for scaling law extrapolation (e.g., NSL-PFN, NeuNeu) for principled decision-making (Hu et al., 27 Jan 2026).

In conclusion, empirical neural scaling laws provide quantifiable, architecture- and domain-dependent frameworks for forecasting model performance, planning compute investments, and understanding the role of inductive bias at large scale (Trikha et al., 26 Sep 2025, Ngo et al., 10 Oct 2025, Caballero et al., 2022, Hu et al., 27 Jan 2026). Their ongoing refinement has shifted practice toward both data- and theory-driven resource allocation in modern machine learning.