Broken Neural Scaling Laws (BNSLs)

Updated 1 July 2025

Broken Neural Scaling Laws (BNSLs) are systematic deviations from classic power-law scaling that exhibit non-monotonic trends, phase transitions, and plateaus in neural network performance.
They leverage a flexible, parameterized framework to capture double descent, sharp inflections, and delayed improvements across diverse architectures and learning tasks.
BNSLs inform optimal model design and resource allocation by diagnosing regime-dependent scaling behaviors that traditional models often overlook.

Broken Neural Scaling Laws (BNSLs) refer to systematic and principled departures from classical power-law scaling relationships that govern the performance of neural networks as a function of model size, dataset size, compute, or other key variables. While traditional neural scaling laws suggest a smooth, monotonic improvement in error or loss—often characterized as a power law—across increasing resources, BNSLs encapsulate the broad spectrum of non-monotonic trends, sharp phase transitions, plateaus, and regime-dependent scaling that emerge in modern architectures, tasks, and data regimes. These deviations have profound consequences for model design, resource budgeting, and the extrapolation of expected performance at scale.

1. Classical Scaling Laws and Motivation for BNSLs

Classic neural scaling laws predict that as a resource—such as the number of model parameters $N$ , data points $D$ , or compute $C$ —grows, the evaluation metric (e.g., test error $\mathcal{L}$ ) falls according to a power law:

$\mathcal{L}(P_1, \ldots, P_n) = \sum_{i=1}^n \alpha_i P_i^{-\beta_i}$

where $P_i$ are scaling variables and $\alpha_i, \beta_i$ are empirically derived coefficients. This relationship underpins much of the strategy behind the upscaling of LLMs, vision transformers, and a host of other architectures, guiding compute-optimal scaling and resource allocation (Sengupta et al., 17 Feb 2025).

However, empirical findings across domains such as vision, language, reinforcement learning, fine-tuning, and within contemporary sparse, modular, and retrieval-augmented models demonstrate regular and reproducible breakdowns of this universal law. This breakdown is often non-random: it is shaped by architecture, learning regime, data structure, regularization, and other contextual factors (Caballero et al., 2022, Sengupta et al., 17 Feb 2025, Maloney et al., 2022).

2. Formalizing Broken Neural Scaling Laws

To account for deviations from simple power-law trends, BNSLs introduce a more flexible, parametrized framework that can express piecewise, non-monotonic, or inflected scaling curves observed in practice. The canonical functional form for a BNSL, as formalized in Caballero et al., is:

$y = a + b x^{-c_0} \prod_{i=1}^n \left(1 + \left(\frac{x}{d_i}\right)^{1/f_i}\right)^{-c_i f_i}$

with $y$ as the evaluation metric (e.g., test error), $x$ as the scaled resource (e.g., model size), $a$ as irreducible error, $b, c_0$ as initial scaling offset and exponent, and the product terms parameterizing break locations ( $d_i$ ), change in slope ( $c_i$ ), and break sharpness ( $f_i$ ) (Caballero et al., 2022). This structure recovers classic power laws in the absence of breaks ( $n=0$ ), but can flexibly accommodate phenomena such as:

Double descent: Error or loss may first increase, then decrease with increasing scale—producing a non-monotonic "bump."
Sharp inflection points: Abrupt transitions in scaling slope, sometimes corresponding to emergent capabilities (e.g., arithmetic, OOD detection).
Plateaus: Regions in scaling where additional resources do not produce further improvement.
Delayed improvements: Scaling benefits initiate only after surpassing a critical resource threshold.

These features, consistently demonstrated in large benchmarks across vision, language, reinforcement learning, multimodal, and emergent-task domains, make BNSLs empirically and practically more accurate than monotonic fits for both interpolation and extrapolation tasks.

3. Practical Manifestations and Empirical Regimes

Broken scaling behaviors have been documented in:

Sparse models: Effective improvements plateau earlier, as increasing sparsity leads to diminishing returns (Sengupta et al., 17 Feb 2025).
Mixture-of-Experts (MoE) and modular routing models: Scaling the total parameter count via expert multiplication only improves performance up to a threshold, beyond which further gains saturate or diverge from dense-law predictions. Performance depends on both total parameters and per-example "activated" parameters (Sengupta et al., 17 Feb 2025).
Retrieval-augmented models: These models fundamentally alter scaling properties—the effective scaling variable includes both model parameters and the size of retrieval memory, breaking standard saturation and potentially beating standalone model scaling (Sengupta et al., 17 Feb 2025).
Multimodal and vision-LLMs: Competition between modalities for representational capacity yields non-additive scaling, with phase transitions between competing and synergistic scaling regimes that invalidate naive power-law superposition (Sengupta et al., 17 Feb 2025).
Pruning and data efficiency interventions: Data selection or pruning can yield better-than-power-law scaling (even exponential in ideal cases), in explicit contradiction to classic laws (Sorscher et al., 2022).
Architectural and optimization changes: Substituting training algorithms (e.g., replacing backpropagation with direct feedback alignment) can break scaling behavior, leading to higher loss at all scales and changing the fundamental efficiency curve (Filipovich et al., 2022).

Table: Examples of Empirical BNSLs and Causal Factors

Domain/Setting	Broken Scaling Phenomenon	Causal Mechanism or Condition
MoE/Routed Networks	Plateau after expert scaling	Limited per-path parameterization
Retrieval-Augmented	Sub-linear model scaling, no saturation	Dynamic external memory function
Multimodal	Competition/synergy regimes	Modality interaction, phase competition
Data Pruning	Exponential (not power-law) scaling	Optimal information retention
DFA vs. Backpropagation	Shallower scaling slope, higher offset	Algorithm inefficiency
Downstream/OOD generalization	Non-monotonic/delayed scaling	Transfer phase, misalignment

4. Predictability, Extrapolation, and Fundamental Limits

While BNSLs provide a superior fit for observed scaling phenomena, their increased flexibility brings theoretical and practical caution for extrapolation:

Prediction limits: For extremely sharp regime changes (large $|c_i|$ , small $f_i$ ), accurate extrapolation across future transitions is fundamentally impossible without post-transition data—highlighting the foreseeability limit of emergent behaviors or new phases (Caballero et al., 2022).
Fidelity of extrapolation estimators: Methods such as the M4 estimator are designed to detect and accommodate BNSL-regimes by modeling both sigmoidal transitions and asymptotic power-laws, outperforming naive fits in predicting downstream or wild-regime metrics (Alabdulmohsin et al., 2022).
Implications for AI forecasting: Accurate modeling of BNSLs is critical for projecting new capability onsets, resource budgeting, and risk management in alignment and safety contexts (Caballero et al., 2022).

5. Theoretical Origins and Diagnostic Use

Theoretical work situates the emergence and breakdown of scaling laws in:

Statistical properties of data: Scaling relies on the power-law spectral structure of empirical covariance matrices or data manifolds. Once model/data size outgrows the effective latent dimension, performance plateaus and scaling laws fail (BNSL onset) (Maloney et al., 2022).
Resource allocation: In composite or modular tasks, additive and proportional allocation of resources to subtasks underlies scaling; deviations (e.g., bottlenecks, abrupt module emergence) cause scaling breaks (Song et al., 7 Feb 2024).
Implicit optimization bias: The learning trajectory and generalization, influenced by the implicit bias of the optimizer, drive the emergence or breakdown of scaling laws, with different architectural classes exhibiting variable transitions (D'Amico et al., 19 May 2025).
Symmetry/duality breaking: In large-N field-theoretic models, breaking symmetry between sample and feature scaling (e.g., via noise, regularization, or rich representation learning) can precipitate BNSLs, including double descent (Zhang, 29 May 2024).

Diagnostically, BNSLs are revealed by:

Non-monotonicity in learning curves (e.g., double descent).
Abrupt or sharp regime transitions (inflection or phase change).
Plateaus or unexpected saturation in error upon further scaling.
Shifts in scaling exponents associated with architecture or data change.

6. Impact on Model Development and Adaptive Strategies

The presence of BNSLs compels moving from universal scaling heuristics to adaptive, architecture- and context-tailored scaling strategies:

Compute/inference-aware scaling: When inference costs are non-trivial, optimal model sizing may deviate from power-law expectations (Sengupta et al., 17 Feb 2025).
Data curation and mixture-aware approaches: Proper data mixture and curation can enable sub-exponential or improved scaling, while additional data of poor utility can break expected laws (Sorscher et al., 2022, Sengupta et al., 17 Feb 2025).
Joint resource allocation: Scaling must take into account the interplay of model size, data, compute, architecture, and external resources (e.g., retrieval memory) for optimal outcomes.
Monitoring and diagnosing breaks: Continuous empirical assessment for phase changes, plateau onset, or non-monotonicity is essential, along with tailored remediation (e.g., reallocation, regularization, architecture modification).

In summary, Broken Neural Scaling Laws provide a unifying formalism and predictive diagnostic for the diverse and often non-universal patterns observed in the scaling behavior of modern neural networks. Accurately modeling and responding to BNSLs is essential for effective scaling, principled extrapolation, and robust deployment across the rapidly expanding spectrum of neural architectures and learning paradigms.