Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaling Laws Re-evaluated

Updated 6 March 2026
  • Scaling laws are mathematical relationships that quantify how performance metrics decrease with increased model, data, or compute scale, typically following a power-law behavior.
  • They integrate empirical and theoretical insights from domains like language modeling, vision, and retrieval to enable accurate forecasting and resource optimization.
  • Recent advances address regime shifts, double descent, and sociotechnical limitations, prompting refinements such as broken neural scaling laws and automated law discovery.

Scaling laws in machine learning and the physical sciences establish quantitative relationships between measures of scale (such as model size, data size, or available compute) and empirical or theoretical performance. Over the past several years, empirical scaling laws—most commonly power-law relationships—have been extensively documented in modern deep learning, statistical physics, and complex systems. This encyclopedia entry provides a re-evaluation of scaling laws, integrating recent developments across application domains, highlighting both universal findings and emerging limits or caveats.

1. Mathematical Foundations of Scaling Laws

Scaling laws traditionally describe how a performance metric LL (error, loss, or success probability) varies as a function of a resource or design parameter xx (model size NN, dataset size DD, compute CC, etc.). The canonical form is a power law: L(x)=Axα+BL(x) = A x^{-\alpha} + B where A,B>0A, B > 0, and α>0\alpha > 0 is the scaling exponent. This model predicts that increasing xx yields diminishing improvements. In practice, dependencies on multiple factors are modeled additively or multiplicatively, such as: L(N,D,C)=ANNαN+ADDαD+ACCαC+LL(N, D, C) = A_N N^{-\alpha_N} + A_D D^{-\alpha_D} + A_C C^{-\alpha_C} + L_\infty for language and vision models (Rosenfeld, 2021, Nezhurina et al., 5 Jun 2025).

However, empirical studies reveal that this simple form is insufficient in several settings: phase transitions, inflection points, or nonmonotonic “double descent” phenomena necessitate more general functional forms. The Broken Neural Scaling Law (BNSL) is designed to address these cases: L(x)=Cxα1[1+(x/xb)s](α1α2)/s+DL(x) = C x^{-\alpha_1} [1 + (x / x_b)^s ]^{(\alpha_1-\alpha_2)/s} + D allowing for smoothly broken power-laws and “regime” changes (Caballero et al., 2022).

2. Universal and Domain-Specific Findings

Empirical Universality

Scaling laws with power-law form have been empirically observed in a wide range of domains:

  • Language Modeling: Cross-entropy loss decreases as a power-law in compute and model/data size, with fitted exponents in the range $0.04$–$0.1$ for models like GPT-3, LLaMA, and their derivatives (Rosenfeld, 2021, Shen et al., 2024).
  • Vision: Image classification and zero-shot/few-shot adaptation exhibit power-law scaling in data and model size, with exponents typically between $0.3$ and $1.1$ (steeper for few-shot, out-of-distribution tasks) (Prato et al., 2021).
  • Dense Retrieval and Reranking: Retrieval loss and reranker metrics (NDCG, MAP) obey saturating power laws in model and data scale, even for discontinuous downstream metrics (Fang et al., 2024, Seetharaman et al., 5 Mar 2026).
  • Symbolic Regression: Deep transformer models for symbolic regression display steep scaling laws, with loss L(C)1.3×103C0.21L(C) \approx 1.3\times 10^{3} C^{-0.21}, outperforming analogous LLM exponents per FLOP (Otte et al., 30 Oct 2025).

Theory and Mechanistic Insights

Mathematical results establish that power-law scaling exponents are determined by the spectral (eigenvalue) decay of data or feature covariances:

  • In kernel and linear regression:

E[L]=σ2+Θ(m1a)+Θ((Neffγ)(a1)/a)\mathbb{E}[L] = \sigma^2 + \Theta(m^{1-a}) + \Theta((N_{\rm eff}\gamma)^{-(a-1)/a})

where aa is the exponent in the covariance tail λiia\lambda_i \sim i^{-a} (Chen et al., 3 Mar 2025).

  • In kernel ridge regression, the learning curve exponent is

α=2s2s+1/β\alpha = \frac{2s}{2s + 1/\beta}

where β\beta controls the tail of the covariance spectrum, and ss is a source smoothness parameter; this explicit dependence formalizes the "redundancy law" for learning efficiency (Bi et al., 25 Sep 2025).

  • The "redundancy law" further predicts that systems with flatter (slower-decaying) spectra see slower scaling improvements; highly compressible data enables faster scaling.

Robustness and Practical Use

Scaling laws robustly extrapolate future gains from smaller experiments in many regimes (Ivgi et al., 2022, Jones, 2021, Seetharaman et al., 5 Mar 2026). For instance, NDCG and MAP for billion-parameter rerankers can be predicted accurately from <<400M-parameter models (Seetharaman et al., 5 Mar 2026). In pre-training/fine-tuning, performance extrapolation with high R2R^2 is possible provided careful hyperparameter tuning and sufficient scale diversity (Ivgi et al., 2022).

3. Limits, Deviations, and Interpretability

Regularity Violations and Non-Power-Law Behavior

Not all domains or metrics exhibit smooth scaling laws:

  • Certain tasks or evaluation metrics (e.g., MRR, contrastive entropy) can show non-monotonic or erratic scaling behavior, breaking power-law predictability even as other metrics scale reliably in the same system (Seetharaman et al., 5 Mar 2026).
  • "Double descent" and regime-shift phenomena (common in overparameterized models) appear as sharp inflection points or nonmonotonicity in the scaling law. These cannot be captured by single-slope power laws and necessitate a smoothly broken form (Caballero et al., 2022).

Influence of Data, Architecture, and Optimization

  • Redundancy bottlenecks: Mixed-domain pretraining, dataset overlap, and multi-modal training can sharply reduce scaling exponents if the heaviest-tailed (most redundant) component dominates (Bi et al., 25 Sep 2025, Shukor et al., 12 Jul 2025).
  • Architectural changes: Mixture-of-Experts, attention mechanisms, and feature learning can quantitatively shift scaling behavior; there remains ongoing work to link such changes to precise exponent shifts (Otte et al., 30 Oct 2025).
  • Hyperparameters, especially batch size and learning rate, may scale nontrivially with model or compute size. In symbolic regression, optimal learning rate increases with scale, in contrast to the decreasing trend in standard LMs (Otte et al., 30 Oct 2025).

Theoretical Challenges in Explanation

Multiple works have shown that traditional VC-based theory is largely vacuous in the scaling regime; instead, random matrix theory, spectral analysis, and deterministic equivalence are key technical tools to explain observed learning curves and error plateaus (Chen et al., 3 Mar 2025, Wang et al., 3 Feb 2025). The norm-based capacity theory predicts that when model norm is the relevant complexity measure, classical U-shaped learning curves reappear, and double descent disappears (Wang et al., 3 Feb 2025).

4. Comparison, Optimization, and Automated Discovery

Model and Dataset Comparison

Cross-scale scaling law derivation enables principled comparison between models and pre-training procedures:

  • In vision-language pre-training, scaling-law fits reveal systematically better scaling exponents and lower irreducible errors for generative+contrastive training (MaMMUT) versus contrastive-only (CLIP), consistently across classification, retrieval, and segmentation tasks (Nezhurina et al., 5 Jun 2025).
  • Data mixture optimization: Closed-form (or numerically tractable) scaling law formulas allow direct calculation of the performance-maximizing domain mixture for any compute budget, outperforming costly trial-and-error approaches (Shukor et al., 12 Jul 2025).

Automated Law Discovery

Evolutionary algorithms guided by LLMs can now autonomously rediscover and in some cases surpass human-derived scaling laws, co-optimizing symbolic expressions and fitting routines for best cross-group fit. The EvoSLD system yields parsimonious, interpretable forms and outperforms traditional symbolic regression or naive approaches by orders of magnitude on held-out NMSE in multiple real-world settings (Lin et al., 27 Jul 2025).

5. Caveats, Critiques, and Social-Scientific Reappraisal

Sociotechnical Limitations of "Universal" Scaling

Recent research raises critical concerns regarding the universal applicability of scaling laws:

  • Metrics as proxies: The metrics used in most scaling analyses may not adequately capture the plural notions of "quality" relevant to diverse user communities; large models can degrade on minoritized subgroups or alternate constructs as size increases (Diaz et al., 2023).
  • Inverse scaling and subgroup degradation: Empirical work demonstrates that beyond a critical point, models may plateau or even worsen on specific constructs (truthfulness, fairness, low-resource language coverage), contradicting the simplicity of universal scaling curves (Diaz et al., 2023).
  • Need for localized scaling laws: Disaggregated scaling curves per subgroup, participatory metric design, and non-scalable, community-specific architectures are recommended to avoid misleading averages masking harm (Diaz et al., 2023).

Predictability and Fundamental Uncertainty

A limit to predictability arises when sharp phase transitions or inflection points fall outside the training regime; BNSL fitting can only extrapolate such features if candidate breaks are sampled by experimental designs (Caballero et al., 2022).

6. Practical Methodologies and Implementation Guidance

Scaling-law guided experimentation benefits from precise protocol design:

  • Fit in (log,log) space after extracting the compute-optimal Pareto frontier.
  • Use rigorous cross-validation and, for automated discovery, co-evolve both algebraic form and optimization subroutine (Lin et al., 27 Jul 2025).
  • Optimize data/model allocation for fixed budget by closed-form constraint minimization, incorporating costs for annotation, training, and even inference (Fang et al., 2024).
  • For new tasks or domains, use multiple small-scale runs to probe for regime breaks or anomalous scaling, revising law forms as needed (Ivgi et al., 2022, Caballero et al., 2022).
  • Quantify and report uncertainty by bootstrapping and out-of-sample extrapolation errors; R2^2 values >0.95>0.95 indicate reliable forecasting but special attention is needed near regime shifts or when observed performance departs systematically from the predicted law (Ivgi et al., 2022).

7. Future Directions and Open Questions

  • Extension of scaling-law analysis to new domains including symbolic regression with more variables, denser/correlated regimes in kernel methods, DNA denaturation transitions with environmental/sequence heterogeneity, or complex multi-stage retrieval pipelines (Otte et al., 30 Oct 2025, Chen et al., 3 Mar 2025, Honchar et al., 2021, Seetharaman et al., 5 Mar 2026).
  • Active design of experiments for optimal law estimation and detection of regime changes.
  • Direct measurement and control of feature covariance spectra in LLMs as a predictor of scaling exponents (Chen et al., 3 Mar 2025, Bi et al., 25 Sep 2025).
  • Determination of scaling behavior in models beyond current compute regimes, including multi-billion/multi-trillion parameter models, and test of law extrapolation beyond currently observed domains.
  • Sociotechnical integration: development of plural, community-specific scaling laws and metrics, acknowledging irreducible value tensions in large-scale AI deployment (Diaz et al., 2023).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scaling Laws Re-evaluated.