Neural Scaling Laws
- Neural scaling laws are quantitative relationships where deep network performance metrics improve predictably with increased parameters, training data, or compute resources.
- They enable researchers to forecast model accuracy and generalization by analyzing empirical power-law trends across various architectures and learning tasks.
- They guide optimal resource allocation in large-scale AI systems, helping to identify regimes of diminishing returns and design cost-effective, high-performing models.
Neural scaling laws describe the empirical observation that the performance metrics—most commonly generalization error, loss, or accuracy—of deep neural networks improve predictably as a power-law function of key system resources, such as the number of model parameters, training dataset size, or the available compute budget. Across modalities and architectures, these power laws hold over multiple orders of magnitude, providing a foundational principle for guiding large-scale model design, resource allocation, and forecasting achievable performance. While scaling laws were initially discovered in large language and vision models, rigorous theoretical and empirical work demonstrates their broad applicability and reveals that the scaling exponents and limits are governed by intrinsic properties of the data distribution and learning task.
1. Formal Definition and Empirical Universality
A neural scaling law expresses a performance metric (e.g., test loss ) as a function of a scaling variable such as model size or dataset size :
where the exponents are empirically determined and encode how effectively the system benefits from additional resources. Multiple studies, including exhaustive empirical analyses of LLMs, vision transformers, diffusion models, and operator networks, consistently report clean power-law fits across regimes of practical interest. This universality suggests a deeper connection to the structure of natural data and underlying learning dynamics (Brill, 10 Dec 2024).
Scaling behaviors extend to composite dependencies on data, model, and compute, of the form
with an irreducible noise floor. More sophisticated forms, such as smoothly broken power laws, capture observed inflections and multi-phase regimes in large-scale practical settings (Caballero et al., 2022).
2. Theoretical Foundations: Data-Distribution Rooted Models
Recent theoretical developments frame neural scaling laws as emerging from the interaction between model expressivity and the statistical structure of the learning task. Brill (2025) introduces a percolation-theoretic model of natural data, where effective learning decomposes into "quanta" associated with discrete subtasks or semantic clusters (Brill, 10 Dec 2024). The model posits:
- After factoring invariances, data reside on an effective high-dimensional lattice.
- Local learning subtasks arise as clusters of lattice sites sharing the same target function, with site connectivity governed by a bond percolation process.
- At the percolation threshold , the cluster size distribution is a power law, .
Criticality in this generative model gives rise to two universal scaling regimes for error as a function of model degrees of freedom or dataset size :
- Quanta/subtask regime (critical percolation, ):
- Model-limited: , with . For Bethe lattice (), .
- Data-limited: .
- Manifold regime (supercritical, ):
- Error scaling governed by manifold approximation: or , where is intrinsic cluster dimension and is an architecture-specific constant.
This framework unifies and grounds previous phenomenological scaling models, relating "discrete quanta" scaling [Michaud et al.] and data-manifold approximation [Bahri & Sharma] to underlying percolation criticality (Brill, 10 Dec 2024).
3. Scaling Exponents, Architectural Factors, and Data Geometry
The exponents and reflect the geometric and statistical structure of the data as well as properties of the architecture:
- In the limit where the data manifold has intrinsic dimension and ReLU networks provide piecewise-linear fits, the scaling exponent is for both cross-entropy and MSE losses (Sharma et al., 2020).
- For composite or multi-subtask tasks, modular resource allocation and subtask loss additivity yield neuron scaling, and, via typical deep network parameterizations, for parameter count (Song et al., 7 Feb 2024, Liu et al., 2023).
- The "Criticality" class of scaling laws subsumes a spectrum of behavior from strong manifold effects () to subtask-dominated regimes (), seamlessly interpolated by the percolation-theoretic framework (Brill, 10 Dec 2024).
Task and architecture variation influence exponents: vision and deep regression tasks can show in the $1-2$ range for data scaling (Cadez et al., 12 Sep 2025), while LLMs exhibit exponents and data exponents depending on pretraining regime (Sengupta et al., 17 Feb 2025).
4. Emergent Regimes, Limitations, and Extension to Practice
Scaling laws can break, saturate, or transition across regimes:
- Empirical curves display "smoothly broken" behavior with inflections at critical data or compute thresholds, motivating the Broken Neural Scaling Law (BNSL) formalism to model multi-phase, nonmonotonic trends (Caballero et al., 2022).
- Fine-tuning, multimodal, reinforcement learning, and data-limited settings can induce regime changes (e.g., plateaux giving way to power-law decay at larger scale) (Sengupta et al., 17 Feb 2025).
- Rigorous theory predicts when to expect sharp transitions, irreducible error floors, or diminishing returns due to finite data or model bottlenecks (Caballero et al., 2022, Brill, 10 Dec 2024).
A summary of typical scaling exponents by domain:
| Domain | Model Exponent | Data Exponent |
|---|---|---|
| Language | 0.05–0.30 | 0.05–0.25 |
| Vision | 0.20–0.30 | ~0.10 |
| Regression | 1.0–2.3 | 0.8–2.3 |
Empirically, progress along a scaling law curve reliably predicts achievable performance so long as the system remains in a regime where the data model, task, and architecture are consistent with prior power-law fits (Sengupta et al., 17 Feb 2025).
5. Metrics Beyond Cross-Entropy and Rank-Based Scaling
Most scaling studies focus on cross-entropy loss, but this omits aspects vital for deployment—such as the rank ordering of correct predictions. The Relative-Based Scaling Law, defined via the Relative-Based Probability (RBP) metric
$RBP_k = \Pr(\text{true token in top-$k$ predicted}),$
obeys its own power-law scaling:
with the non-embedding parameter count and increasing with (Yue et al., 23 Oct 2025). RBP scaling closely tracks cross-entropy based scaling, but governs phenomena related to emergence in sequence prediction (e.g., sharp thresholding behavior for long-range accuracy). This provides a quantitative framework for predicting the onset and shape of "emergent" capabilities as models grow.
6. Practical Implications, Fitting, and Automated Law Discovery
Neural scaling laws now guide the design and resource allocation for large-scale AI systems. Key implications include:
- Compute allocation: theory predicts compute-optimal tradeoffs between data size and model size, e.g., scaling both linearly for fixed compute budgets (), with the law's exponent dictating cost/benefit (Jeon et al., 28 Jun 2024).
- Design: empirical or theory-driven exponent estimates allow practitioners to forecast the future gains from increasing data or model capacity and to avoid regimes with sharp diminishing returns (Sengupta et al., 17 Feb 2025).
- Automated discovery: frameworks such as EvoSLD co-evolve symbolic law expressions with domain-specific optimizers, yielding parsimonious, generalizable, and highly accurate scaling laws across experimental settings (Lin et al., 27 Jul 2025).
- Caution: Scaling laws are not universal—critical regime changes, architectural innovations (sparse, retrieval-augmented, multimodal models), or domain shifts may break existing power-law trends (Caballero et al., 2022, Sengupta et al., 17 Feb 2025). Rigorous uncertainty estimation and cross-validation are required.
7. Open Questions and Directions
Fundamental directions for neural scaling law research include:
- Extending theoretical frameworks beyond percolation and manifold models to account for data heterogeneity, compositionality, and information-theoretic constraints (Brill, 10 Dec 2024, Jeon et al., 28 Jun 2024).
- Formalizing the limits of predictability due to sharp phase transitions and lawful extrapolation beyond existing data (Caballero et al., 2022).
- Integrating fairness, robustness, and inference-time scaling into the core scaling law formalism (Sengupta et al., 17 Feb 2025).
- Linking finite-width corrections, training dynamics, and optimization artifacts with asymptotic scaling exponents (Bordelon et al., 2 Feb 2024).
- Measuring and modeling the percolation structure or intrinsic dimension in practical LLM corpora and scientific datasets to forecast scaling performance in new domains (Brill, 10 Dec 2024, Sharma et al., 2020).
References
- (Brill, 10 Dec 2024) Neural Scaling Laws Rooted in the Data Distribution
- (Yue et al., 23 Oct 2025) Relative-Based Scaling Law for Neural LLMs
- (Cadez et al., 12 Sep 2025) Neural Scaling Laws for Deep Regression
- (Lin et al., 27 Jul 2025) EvoSLD: Automated Neural Scaling Law Discovery With LLMs
- (Sengupta et al., 17 Feb 2025) How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines
- (Liu et al., 1 Oct 2024) Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study
- (Jeon et al., 28 Jun 2024) Information-Theoretic Foundations for Neural Scaling Laws
- (Song et al., 7 Feb 2024) A Resource Model For Neural Scaling Law
- (Bordelon et al., 2 Feb 2024) A Dynamical Model of Neural Scaling Laws
- (Maloney et al., 2022) A Solvable Model of Neural Scaling Laws
- (Caballero et al., 2022) Broken Neural Scaling Laws
- (Sharma et al., 2020) A Neural Scaling Law from the Dimension of the Data Manifold
- (Liu et al., 2023) A Neural Scaling Law from Lottery Ticket Ensembling