Papers
Topics
Authors
Recent
Search
2000 character limit reached

Broken Neural Scaling Laws in Deep Learning

Updated 27 January 2026
  • Broken Neural Scaling Laws (BNSL) are defined as non-monotonic relationships that capture deviations from classical power-law scaling, evidenced by plateaus, double descent, and abrupt transitions.
  • BNSL extend traditional scaling models by introducing piecewise power-law frameworks to more accurately represent performance across diverse architectures and tasks.
  • Empirical diagnostics and fitting methods for BNSL enable refined resource allocation strategies, highlighting optimal model sizes and training duration to avoid performance pitfalls.

Broken Neural Scaling Laws (BNSL) are functional relationships that model how neural network performance metrics (e.g., loss, accuracy) vary in a non-uniform, often non-monotonic way as model scale, training data, or computational resources are increased. Unlike classical neural scaling laws, which posit smooth power-law decay of loss with increased model size or data, BNSLs capture systematic deviations—manifesting as plateaus, inflections, double descent phenomena, or sharp transitions—across a broad array of neural architectures, modalities, and tasks. These broken scaling behaviors are observed universally, including in large-scale language and vision models, pruned or sparse models, mixture-of-experts architectures, retrieval-augmented systems, and during domain adaptation or fine-tuning, necessitating a richer, more expressive analytical framework (Caballero et al., 2022, Sengupta et al., 17 Feb 2025).

1. Standard Neural Scaling Laws and Their Limitations

Traditional neural scaling laws predict that a model's evaluation loss LL obeys a parametric power law as a function of the number of parameters (NN), dataset size (DD), and compute budget (CC):

L(N,D,C)ANα+BDβ+ECγL(N, D, C) \approx A N^{-\alpha} + B D^{-\beta} + E C^{-\gamma}

with empirically determined exponents α,β,γ>0\alpha, \beta, \gamma > 0 and positive constants A,B,EA, B, E. In practice, exponents vary by domain: for example, α0.07\alpha \approx 0.07–0.25 and β0.05\beta \approx 0.05–0.3 for LLMs. The Chinchilla scaling law refines this by advocating DND \propto N for compute-optimal scaling (Sengupta et al., 17 Feb 2025). On log–log plots, these laws yield straight-lines, readily fit and extrapolated via standard regression.

While such laws are valid over several orders of magnitude and facilitate automated resource allocation, empirical evidence across domains shows persistent breakdowns—regions where loss curves bend, plateau, or even rise, violating monotonicity and the separability assumption inherent in single power-law forms (Caballero et al., 2022, Sengupta et al., 17 Feb 2025). In these regions, classical scaling laws provide misleading or failed extrapolations.

2. Empirical Manifestations of Broken Scaling

Multiple large-scale studies have established BNSL as a general phenomenon. Key patterns include:

  • Double descent: U-shaped curves in test loss as a function of width, data, or training duration, particularly pronounced in overparameterized or noisy regimes (Caballero et al., 2022, Boopathy et al., 2024).
  • Plateaus: Loss ceases to decrease, even with further scale increases, typically when an intrinsic spectral limit of data or model capacity is reached (Maloney et al., 2022, Caballero et al., 2022).
  • Transition thresholds: Sharp or gradual inflections mark the onset of emergent capabilities (e.g., arithmetic reasoning), non-monotonicities in transfer or domain adaptation, or late recovery from double descent (Caballero et al., 2022).
  • Architecture-specific anomalies: High sparsity, mixture-of-experts routing, and retrieval-augmented systems introduce additional scaling axes and break standard predictions. For mixture-of-experts, unified laws such as

logL(N,E)=alogN+blogE+c(logN)(logE)+d\log L(N, E) = a \log N + b \log E + c (\log N)(\log E) + d

more accurately model observed loss than purely separable power laws (Sengupta et al., 17 Feb 2025).

Empirically, these behaviors have been documented across CV (e.g., CIFAR-100, ImageNet), NLP (BIG-Bench, machine translation), generative modeling (diffusion models, FID), and multi-agent reinforcement learning (Caballero et al., 2022, Sengupta et al., 17 Feb 2025).

3. Theoretical Formulations of BNSL

To capture these phenomena, Broken Neural Scaling Laws generalize the classical power law into a smoothly broken, piecewise form on log–log axes. For a scaling variable xx (parameters, data, compute) and a performance metric yy, the nn-break BNSL is:

y=a+bxc0i=1n[1+(xdi)1/fi]cifiy = a + b x^{-c_0} \prod_{i=1}^n \left[ 1 + \left(\frac{x}{d_i}\right)^{1/f_i} \right]^{-c_i f_i}

or equivalently,

y=a+bxc0exp(i=1ncifilog(1+(xdi)1/fi))y = a + b x^{-c_0} \exp\left(-\sum_{i=1}^n c_i f_i \log\left(1 + \left(\frac{x}{d_i}\right)^{1/f_i}\right)\right)

where breaks at xdix \approx d_i change the slope of logy\log y vs. logx\log x by ci-c_i, with fif_i controlling the smoothness of each transition. For n=0n=0, this reduces to the classical power law (Caballero et al., 2022).

This functional form allows for:

  • Multiple linear regimes in log–log, corresponding to distinct scaling "phases"
  • Inflection points and non-monotonicity (e.g., double descent)
  • Sharp or soft transition control via fif_i

Refined scaling forms for special architectures and training regimes appear in sparse/pruned models, retrieval-augmented learning, and multimodal systems. For example, a "P² law" models the non-monotonicities arising from high sparsity:

L(N0,D,ρ,L0)=L0+(1/ρ)γ(1/N0)δ[NcN0α+DcDβ+E]L(N_0, D, \rho, L_0) = L_0 + (1/\rho)^{\gamma} (1/N_0)^{\delta} \left[ \frac{N_c}{N_0^{\alpha}} + \frac{D_c}{D^{\beta}} + E \right]

with ρ\rho as retained parameter fraction (Sengupta et al., 17 Feb 2025).

4. Mechanistic Origins and Universality

Breakdown of classical scaling arises from several mechanisms:

  • Spectral bottlenecks: In statistical models with power-law spectrum (λII(1+α)\lambda_I \sim I^{-(1+\alpha)}), scaling only holds as long as N,DMN, D \ll M (latent feature dimension). Once NN or DD approaches MM, scaling breaks and loss plateaus (Maloney et al., 2022).
  • Lottery ticket ensembling: For very wide networks, performance transitions from approximation-theoretic to statistical ensembling regimes. The loss can evolve from the expected N4/dN^{-4/d} law (classical) to a N1N^{-1} regime determined by the number of independent lottery tickets, with variance reductions governed by central-limit behavior (Liu et al., 2023).
  • Optimization-induced transitions: Overparameterization or overtraining in noisy conditions leads to double descent and eventual broken scaling, where error increases with additional parameters or epochs unless regularized or stopped optimally (Boopathy et al., 2024, Caballero et al., 2022).
  • Renormalization group (RG) effects: RG analysis reveals that non-Gaussian corrections (e.g., quartic interactions, spectrum discreteness) induce "scaling intervals" and non-universal transient exponents. Universal scaling is recovered only in the infinite data/model-width limit (Gaussian Process fixed point), while at finite PP substantial systematic deviations (broken scaling) are observed (Coppola et al., 29 Oct 2025).

The universality of exponents at large scale is thus contingent on minimality of architectural and data-induced perturbations; otherwise, BNSL prevails in finite-sample, finite-width settings.

5. Diagnostics, Fitting Methodology, and Extrapolation

BNSL requires rigorous diagnostic and fitting procedures:

  • Residual analysis: Fitting a power law and plotting residuals vs. logN\log N (or other axes); systematic curvature or deviations signal a breakdown (Sengupta et al., 17 Feb 2025).
  • Local exponent drift: Estimating α(N)=dlogL/dlogN\alpha(N) = -d \log L/d \log N; significant heterogeneity with NN indicates non-uniform scaling (Sengupta et al., 17 Feb 2025).
  • Goodness-of-fit metrics: Comparing R2R^2, AIC, BIC for single- vs. broken-power-law fits; smoother broken forms generally yield superior out-of-sample extrapolation (Caballero et al., 2022).
  • Piecewise/broken power law fitting: Application of grid search and nonlinear least squares (e.g., optimize.curve_fit) with mean squared log error (MSLE) as the objective, and cross-validation for number of breaks (Caballero et al., 2022).
  • Validation protocols: Holding out largest (in-distribution) points to test for required number of breaks; examining extrapolation error on points >1.3×>1.3\times2×2\times beyond the fit domain (Caballero et al., 2022).
  • Code and tools: Open-source implementations—e.g., the Python package at https://github.com/ethancaballero/broken_neural_scaling_laws—provide function definitions, gradient computation, break-detection utilities, and plotting scripts (Caballero et al., 2022).

For adaptive allocation under BNSL, iterative planners update (N,D)(N, D) based on observed exponent drift, dynamically switch to broken-power models, and reallocate resources for near-optimality within observed scaling regimes (Sengupta et al., 17 Feb 2025).

6. Practical Guidelines and Implications

Practical strategies based on BNSL include:

  • Under fixed compute, model size and training duration can be freely traded (scale–time equivalence); practitioners can choose based on resource or scheduling constraints (Boopathy et al., 2024).
  • Observation of double descent or rising loss signals regimes where further scaling is deleterious; optimality is achieved near inflection points of E(pT)E(pT) (Boopathy et al., 2024, Caballero et al., 2022).
  • Adding more data is consistently non-harmful (global error never rises), though with diminishing returns (Boopathy et al., 2024).
  • In the presence of noise, large pp or TT can hurt rather than help, necessitating careful selection of termination points, typically identified via small-scale, long-horizon probes (Boopathy et al., 2024).
  • Residual and local-slope-based diagnostics should guide the choice of scaling laws and highlight regions where BNSL capture unpredictable or sharp transitions.
  • BNSL-fitted forms outperform classical models in extrapolation, notably on tasks with non-monotonic transitions—e.g., few-shot learning, arithmetic grokking, emergent chain-of-thought, and OOD generalization (Caballero et al., 2022).

7. Open Challenges and Future Directions

Several open challenges are identified for BNSL research:

BNSL thus represents a necessary generalization of neural scaling law theory, providing both a descriptive and predictive framework for the rich, multi-phase behaviors manifested across the breadth of modern deep learning (Caballero et al., 2022, Boopathy et al., 2024, Sengupta et al., 17 Feb 2025, Maloney et al., 2022, Liu et al., 2023, Coppola et al., 29 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Broken Neural Scaling Laws (BNSL).