Papers
Topics
Authors
Recent
2000 character limit reached

Neural Scaling Laws

Updated 30 November 2025
  • Neural scaling laws are quantitative relationships where deep network performance metrics improve predictably with increased parameters, training data, or compute resources.
  • They enable researchers to forecast model accuracy and generalization by analyzing empirical power-law trends across various architectures and learning tasks.
  • They guide optimal resource allocation in large-scale AI systems, helping to identify regimes of diminishing returns and design cost-effective, high-performing models.

Neural scaling laws describe the empirical observation that the performance metrics—most commonly generalization error, loss, or accuracy—of deep neural networks improve predictably as a power-law function of key system resources, such as the number of model parameters, training dataset size, or the available compute budget. Across modalities and architectures, these power laws hold over multiple orders of magnitude, providing a foundational principle for guiding large-scale model design, resource allocation, and forecasting achievable performance. While scaling laws were initially discovered in large language and vision models, rigorous theoretical and empirical work demonstrates their broad applicability and reveals that the scaling exponents and limits are governed by intrinsic properties of the data distribution and learning task.

1. Formal Definition and Empirical Universality

A neural scaling law expresses a performance metric (e.g., test loss L\mathcal{L}) as a function of a scaling variable such as model size NN or dataset size D\mathcal{D}:

L(N)Nα,L(D)Dβ\mathcal{L}(N) \sim N^{-\alpha}, \quad \mathcal{L}(\mathcal{D}) \sim \mathcal{D}^{-\beta}

where the exponents α,β>0\alpha, \beta > 0 are empirically determined and encode how effectively the system benefits from additional resources. Multiple studies, including exhaustive empirical analyses of LLMs, vision transformers, diffusion models, and operator networks, consistently report clean power-law fits across regimes of practical interest. This universality suggests a deeper connection to the structure of natural data and underlying learning dynamics (Brill, 10 Dec 2024).

Scaling behaviors extend to composite dependencies on data, model, and compute, of the form

L(N,D)ANα+BDβ+C\mathcal{L}(N, \mathcal{D}) \sim A N^{-\alpha} + B \mathcal{D}^{-\beta} + C

with CC an irreducible noise floor. More sophisticated forms, such as smoothly broken power laws, capture observed inflections and multi-phase regimes in large-scale practical settings (Caballero et al., 2022).

2. Theoretical Foundations: Data-Distribution Rooted Models

Recent theoretical developments frame neural scaling laws as emerging from the interaction between model expressivity and the statistical structure of the learning task. Brill (2025) introduces a percolation-theoretic model of natural data, where effective learning decomposes into "quanta" associated with discrete subtasks or semantic clusters (Brill, 10 Dec 2024). The model posits:

  • After factoring invariances, data reside on an effective high-dimensional lattice.
  • Local learning subtasks arise as clusters of lattice sites sharing the same target function, with site connectivity governed by a bond percolation process.
  • At the percolation threshold pcp_c, the cluster size distribution is a power law, nssτn_s \propto s^{-\tau}.

Criticality in this generative model gives rise to two universal scaling regimes for error as a function of model degrees of freedom NN or dataset size D\mathcal{D}:

  • Quanta/subtask regime (critical percolation, ppcp \approx p_c):
    • Model-limited: LNα\mathcal{L} \propto N^{-\alpha}, with α=(3τ)/(τ2)\alpha = (3-\tau)/(\tau-2). For Bethe lattice (τ=5/2\tau = 5/2), α=1\alpha=1.
    • Data-limited: LDα/(1+α)\mathcal{L} \propto \mathcal{D}^{-\alpha/(1+\alpha)}.
  • Manifold regime (supercritical, p>pcp > p_c):
    • Error scaling governed by manifold approximation: LNc/D\mathcal{L} \propto N^{-c/D} or LDc/D\mathcal{L} \propto \mathcal{D}^{-c/D}, where DD is intrinsic cluster dimension and cc is an architecture-specific constant.

This framework unifies and grounds previous phenomenological scaling models, relating "discrete quanta" scaling [Michaud et al.] and data-manifold approximation [Bahri & Sharma] to underlying percolation criticality (Brill, 10 Dec 2024).

3. Scaling Exponents, Architectural Factors, and Data Geometry

The exponents α\alpha and β\beta reflect the geometric and statistical structure of the data as well as properties of the architecture:

  • In the limit where the data manifold has intrinsic dimension dd and ReLU networks provide piecewise-linear fits, the scaling exponent is α4/d\alpha \approx 4/d for both cross-entropy and MSE losses (Sharma et al., 2020).
  • For composite or multi-subtask tasks, modular resource allocation and subtask loss additivity yield L1/N\mathcal{L} \sim 1/N neuron scaling, and, via typical deep network parameterizations, LNp1/3\mathcal{L} \sim N_p^{-1/3} for parameter count NpN_p (Song et al., 7 Feb 2024, Liu et al., 2023).
  • The "Criticality" class of scaling laws subsumes a spectrum of behavior from strong manifold effects (α<1\alpha < 1) to subtask-dominated regimes (α1\alpha \approx 1), seamlessly interpolated by the percolation-theoretic framework (Brill, 10 Dec 2024).

Task and architecture variation influence exponents: vision and deep regression tasks can show α\alpha in the $1-2$ range for data scaling (Cadez et al., 12 Sep 2025), while LLMs exhibit exponents α0.070.3\alpha\sim 0.07-0.3 and data exponents β0.10.25\beta\sim 0.1-0.25 depending on pretraining regime (Sengupta et al., 17 Feb 2025).

4. Emergent Regimes, Limitations, and Extension to Practice

Scaling laws can break, saturate, or transition across regimes:

  • Empirical curves display "smoothly broken" behavior with inflections at critical data or compute thresholds, motivating the Broken Neural Scaling Law (BNSL) formalism to model multi-phase, nonmonotonic trends (Caballero et al., 2022).
  • Fine-tuning, multimodal, reinforcement learning, and data-limited settings can induce regime changes (e.g., plateaux giving way to power-law decay at larger scale) (Sengupta et al., 17 Feb 2025).
  • Rigorous theory predicts when to expect sharp transitions, irreducible error floors, or diminishing returns due to finite data or model bottlenecks (Caballero et al., 2022, Brill, 10 Dec 2024).

A summary of typical scaling exponents by domain:

Domain Model Exponent α\alpha Data Exponent β\beta
Language 0.05–0.30 0.05–0.25
Vision 0.20–0.30 ~0.10
Regression 1.0–2.3 0.8–2.3

Empirically, progress along a scaling law curve reliably predicts achievable performance so long as the system remains in a regime where the data model, task, and architecture are consistent with prior power-law fits (Sengupta et al., 17 Feb 2025).

5. Metrics Beyond Cross-Entropy and Rank-Based Scaling

Most scaling studies focus on cross-entropy loss, but this omits aspects vital for deployment—such as the rank ordering of correct predictions. The Relative-Based Scaling Law, defined via the Relative-Based Probability (RBP) metric

$RBP_k = \Pr(\text{true token in top-$k$ predicted}),$

obeys its own power-law scaling:

logRBPkSαk,-\log RBP_k \propto S^{-\alpha_k},

with SS the non-embedding parameter count and αk\alpha_k increasing with kk (Yue et al., 23 Oct 2025). RBP scaling closely tracks cross-entropy based scaling, but governs phenomena related to emergence in sequence prediction (e.g., sharp thresholding behavior for long-range accuracy). This provides a quantitative framework for predicting the onset and shape of "emergent" capabilities as models grow.

6. Practical Implications, Fitting, and Automated Law Discovery

Neural scaling laws now guide the design and resource allocation for large-scale AI systems. Key implications include:

  • Compute allocation: theory predicts compute-optimal tradeoffs between data size and model size, e.g., scaling both linearly for fixed compute budgets (NDN \propto \mathcal{D}), with the law's exponent dictating cost/benefit (Jeon et al., 28 Jun 2024).
  • Design: empirical or theory-driven exponent estimates allow practitioners to forecast the future gains from increasing data or model capacity and to avoid regimes with sharp diminishing returns (Sengupta et al., 17 Feb 2025).
  • Automated discovery: frameworks such as EvoSLD co-evolve symbolic law expressions with domain-specific optimizers, yielding parsimonious, generalizable, and highly accurate scaling laws across experimental settings (Lin et al., 27 Jul 2025).
  • Caution: Scaling laws are not universal—critical regime changes, architectural innovations (sparse, retrieval-augmented, multimodal models), or domain shifts may break existing power-law trends (Caballero et al., 2022, Sengupta et al., 17 Feb 2025). Rigorous uncertainty estimation and cross-validation are required.

7. Open Questions and Directions

Fundamental directions for neural scaling law research include:

References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Neural Scaling Law.