Broken Neural Scaling Laws in Deep Learning

Updated 27 January 2026

Broken Neural Scaling Laws (BNSL) are defined as non-monotonic relationships that capture deviations from classical power-law scaling, evidenced by plateaus, double descent, and abrupt transitions.
BNSL extend traditional scaling models by introducing piecewise power-law frameworks to more accurately represent performance across diverse architectures and tasks.
Empirical diagnostics and fitting methods for BNSL enable refined resource allocation strategies, highlighting optimal model sizes and training duration to avoid performance pitfalls.

Broken Neural Scaling Laws (BNSL) are functional relationships that model how neural network performance metrics (e.g., loss, accuracy) vary in a non-uniform, often non-monotonic way as model scale, training data, or computational resources are increased. Unlike classical neural scaling laws, which posit smooth power-law decay of loss with increased model size or data, BNSLs capture systematic deviations—manifesting as plateaus, inflections, double descent phenomena, or sharp transitions—across a broad array of neural architectures, modalities, and tasks. These broken scaling behaviors are observed universally, including in large-scale language and vision models, pruned or sparse models, mixture-of-experts architectures, retrieval-augmented systems, and during domain adaptation or fine-tuning, necessitating a richer, more expressive analytical framework (Caballero et al., 2022, Sengupta et al., 17 Feb 2025).

1. Standard Neural Scaling Laws and Their Limitations

Traditional neural scaling laws predict that a model's evaluation loss $L$ obeys a parametric power law as a function of the number of parameters ( $N$ ), dataset size ( $D$ ), and compute budget ( $C$ ):

$L(N, D, C) \approx A N^{-\alpha} + B D^{-\beta} + E C^{-\gamma}$

with empirically determined exponents $\alpha, \beta, \gamma > 0$ and positive constants $A, B, E$ . In practice, exponents vary by domain: for example, $\alpha \approx 0.07$ –0.25 and $\beta \approx 0.05$ –0.3 for LLMs. The Chinchilla scaling law refines this by advocating $D \propto N$ for compute-optimal scaling (Sengupta et al., 17 Feb 2025). On log–log plots, these laws yield straight-lines, readily fit and extrapolated via standard regression.

While such laws are valid over several orders of magnitude and facilitate automated resource allocation, empirical evidence across domains shows persistent breakdowns—regions where loss curves bend, plateau, or even rise, violating monotonicity and the separability assumption inherent in single power-law forms (Caballero et al., 2022, Sengupta et al., 17 Feb 2025). In these regions, classical scaling laws provide misleading or failed extrapolations.

2. Empirical Manifestations of Broken Scaling

Multiple large-scale studies have established BNSL as a general phenomenon. Key patterns include:

Double descent: U-shaped curves in test loss as a function of width, data, or training duration, particularly pronounced in overparameterized or noisy regimes (Caballero et al., 2022, Boopathy et al., 2024).
Plateaus: Loss ceases to decrease, even with further scale increases, typically when an intrinsic spectral limit of data or model capacity is reached (Maloney et al., 2022, Caballero et al., 2022).
Transition thresholds: Sharp or gradual inflections mark the onset of emergent capabilities (e.g., arithmetic reasoning), non-monotonicities in transfer or domain adaptation, or late recovery from double descent (Caballero et al., 2022).
Architecture-specific anomalies: High sparsity, mixture-of-experts routing, and retrieval-augmented systems introduce additional scaling axes and break standard predictions. For mixture-of-experts, unified laws such as

$N$ 0

more accurately model observed loss than purely separable power laws (Sengupta et al., 17 Feb 2025).

Empirically, these behaviors have been documented across CV (e.g., CIFAR-100, ImageNet), NLP (BIG-Bench, machine translation), generative modeling (diffusion models, FID), and multi-agent reinforcement learning (Caballero et al., 2022, Sengupta et al., 17 Feb 2025).

3. Theoretical Formulations of BNSL

To capture these phenomena, Broken Neural Scaling Laws generalize the classical power law into a smoothly broken, piecewise form on log–log axes. For a scaling variable $N$ 1 (parameters, data, compute) and a performance metric $N$ 2, the $N$ 3-break BNSL is:

$N$ 4

or equivalently,

$N$ 5

where breaks at $N$ 6 change the slope of $N$ 7 vs. $N$ 8 by $N$ 9, with $D$ 0 controlling the smoothness of each transition. For $D$ 1, this reduces to the classical power law (Caballero et al., 2022).

This functional form allows for:

Multiple linear regimes in log–log, corresponding to distinct scaling "phases"
Inflection points and non-monotonicity (e.g., double descent)
Sharp or soft transition control via $D$ 2

Refined scaling forms for special architectures and training regimes appear in sparse/pruned models, retrieval-augmented learning, and multimodal systems. For example, a "P² law" models the non-monotonicities arising from high sparsity:

$D$ 3

with $D$ 4 as retained parameter fraction (Sengupta et al., 17 Feb 2025).

4. Mechanistic Origins and Universality

Breakdown of classical scaling arises from several mechanisms:

Spectral bottlenecks: In statistical models with power-law spectrum ( $D$ 5), scaling only holds as long as $D$ 6 (latent feature dimension). Once $D$ 7 or $D$ 8 approaches $D$ 9, scaling breaks and loss plateaus (Maloney et al., 2022).
Lottery ticket ensembling: For very wide networks, performance transitions from approximation-theoretic to statistical ensembling regimes. The loss can evolve from the expected $C$ 0 law (classical) to a $C$ 1 regime determined by the number of independent lottery tickets, with variance reductions governed by central-limit behavior (Liu et al., 2023).
Optimization-induced transitions: Overparameterization or overtraining in noisy conditions leads to double descent and eventual broken scaling, where error increases with additional parameters or epochs unless regularized or stopped optimally (Boopathy et al., 2024, Caballero et al., 2022).
Renormalization group (RG) effects: RG analysis reveals that non-Gaussian corrections (e.g., quartic interactions, spectrum discreteness) induce "scaling intervals" and non-universal transient exponents. Universal scaling is recovered only in the infinite data/model-width limit (Gaussian Process fixed point), while at finite $C$ 2 substantial systematic deviations (broken scaling) are observed (Coppola et al., 29 Oct 2025).

The universality of exponents at large scale is thus contingent on minimality of architectural and data-induced perturbations; otherwise, BNSL prevails in finite-sample, finite-width settings.

5. Diagnostics, Fitting Methodology, and Extrapolation

BNSL requires rigorous diagnostic and fitting procedures:

Residual analysis: Fitting a power law and plotting residuals vs. $C$ 3 (or other axes); systematic curvature or deviations signal a breakdown (Sengupta et al., 17 Feb 2025).
Local exponent drift: Estimating $C$ 4; significant heterogeneity with $C$ 5 indicates non-uniform scaling (Sengupta et al., 17 Feb 2025).
Goodness-of-fit metrics: Comparing $C$ 6, AIC, BIC for single- vs. broken-power-law fits; smoother broken forms generally yield superior out-of-sample extrapolation (Caballero et al., 2022).
Piecewise/broken power law fitting: Application of grid search and nonlinear least squares (e.g., optimize.curve_fit) with mean squared log error (MSLE) as the objective, and cross-validation for number of breaks (Caballero et al., 2022).
Validation protocols: Holding out largest (in-distribution) points to test for required number of breaks; examining extrapolation error on points $C$ 7– $C$ 8 beyond the fit domain (Caballero et al., 2022).
Code and tools: Open-source implementations—e.g., the Python package at https://github.com/ethancaballero/broken_neural_scaling_laws—provide function definitions, gradient computation, break-detection utilities, and plotting scripts (Caballero et al., 2022).

For adaptive allocation under BNSL, iterative planners update $C$ 9 based on observed exponent drift, dynamically switch to broken-power models, and reallocate resources for near-optimality within observed scaling regimes (Sengupta et al., 17 Feb 2025).

6. Practical Guidelines and Implications

Practical strategies based on BNSL include:

Under fixed compute, model size and training duration can be freely traded (scale–time equivalence); practitioners can choose based on resource or scheduling constraints (Boopathy et al., 2024).
Observation of double descent or rising loss signals regimes where further scaling is deleterious; optimality is achieved near inflection points of $L(N, D, C) \approx A N^{-\alpha} + B D^{-\beta} + E C^{-\gamma}$ 0 (Boopathy et al., 2024, Caballero et al., 2022).
Adding more data is consistently non-harmful (global error never rises), though with diminishing returns (Boopathy et al., 2024).
In the presence of noise, large $L(N, D, C) \approx A N^{-\alpha} + B D^{-\beta} + E C^{-\gamma}$ 1 or $L(N, D, C) \approx A N^{-\alpha} + B D^{-\beta} + E C^{-\gamma}$ 2 can hurt rather than help, necessitating careful selection of termination points, typically identified via small-scale, long-horizon probes (Boopathy et al., 2024).
Residual and local-slope-based diagnostics should guide the choice of scaling laws and highlight regions where BNSL capture unpredictable or sharp transitions.
BNSL-fitted forms outperform classical models in extrapolation, notably on tasks with non-monotonic transitions—e.g., few-shot learning, arithmetic grokking, emergent chain-of-thought, and OOD generalization (Caballero et al., 2022).

7. Open Challenges and Future Directions

Several open challenges are identified for BNSL research:

Inference-aware scaling: Integration of test-time compute mechanisms (search, aggregation) into scaling laws beyond training-phase analysis (Sengupta et al., 17 Feb 2025).
Multi-objective optimization: Derivation of joint scaling laws for accuracy, latency, memory, and energy under realistic budget constraints (Sengupta et al., 17 Feb 2025).
Data-centric scaling: Formalization of scaling laws for data curation, domain mixture, and selection, moving beyond raw data volume paradigms (Sengupta et al., 17 Feb 2025).
Architectural generality: Extension to emergent architectures including spiking nets, graph neural networks, and dynamically sparse configurations (Sengupta et al., 17 Feb 2025).
Metric diversification: BNSL for fairness, robustness, calibration, uncertainty estimation, and OOD performance (Caballero et al., 2022, Sengupta et al., 17 Feb 2025).
Standardization and reproducibility: Creation of standardized testbeds, open-source toolsets, and reproducible scaling law benchmarks (Sengupta et al., 17 Feb 2025).

BNSL thus represents a necessary generalization of neural scaling law theory, providing both a descriptive and predictive framework for the rich, multi-phase behaviors manifested across the breadth of modern deep learning (Caballero et al., 2022, Boopathy et al., 2024, Sengupta et al., 17 Feb 2025, Maloney et al., 2022, Liu et al., 2023, Coppola et al., 29 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (6)

Broken Neural Scaling Laws (2022)

How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines (2025)

Unified Neural Network Scaling Laws and Scale-time Equivalence (2024)

A Solvable Model of Neural Scaling Laws (2022)

A Neural Scaling Law from Lottery Ticket Ensembling (2023)

Renormalization group for deep neural networks: Universality of learning and scaling laws (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Broken Neural Scaling Laws (BNSL).

Broken Neural Scaling Laws in Deep Learning

1. Standard Neural Scaling Laws and Their Limitations

2. Empirical Manifestations of Broken Scaling

3. Theoretical Formulations of BNSL

4. Mechanistic Origins and Universality

5. Diagnostics, Fitting Methodology, and Extrapolation

6. Practical Guidelines and Implications

7. Open Challenges and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Broken Neural Scaling Laws in Deep Learning

1. Standard Neural Scaling Laws and Their Limitations

2. Empirical Manifestations of Broken Scaling

3. Theoretical Formulations of BNSL

4. Mechanistic Origins and Universality

5. Diagnostics, Fitting Methodology, and Extrapolation

6. Practical Guidelines and Implications

7. Open Challenges and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research