Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Broken Neural Scaling Laws (BNSLs)

Updated 1 July 2025
  • Broken Neural Scaling Laws (BNSLs) are systematic deviations from classic power-law scaling that exhibit non-monotonic trends, phase transitions, and plateaus in neural network performance.
  • They leverage a flexible, parameterized framework to capture double descent, sharp inflections, and delayed improvements across diverse architectures and learning tasks.
  • BNSLs inform optimal model design and resource allocation by diagnosing regime-dependent scaling behaviors that traditional models often overlook.

Broken Neural Scaling Laws (BNSLs) refer to systematic and principled departures from classical power-law scaling relationships that govern the performance of neural networks as a function of model size, dataset size, compute, or other key variables. While traditional neural scaling laws suggest a smooth, monotonic improvement in error or loss—often characterized as a power law—across increasing resources, BNSLs encapsulate the broad spectrum of non-monotonic trends, sharp phase transitions, plateaus, and regime-dependent scaling that emerge in modern architectures, tasks, and data regimes. These deviations have profound consequences for model design, resource budgeting, and the extrapolation of expected performance at scale.

1. Classical Scaling Laws and Motivation for BNSLs

Classic neural scaling laws predict that as a resource—such as the number of model parameters NN, data points DD, or compute CC—grows, the evaluation metric (e.g., test error L\mathcal{L}) falls according to a power law:

L(P1,,Pn)=i=1nαiPiβi\mathcal{L}(P_1, \ldots, P_n) = \sum_{i=1}^n \alpha_i P_i^{-\beta_i}

where PiP_i are scaling variables and αi,βi\alpha_i, \beta_i are empirically derived coefficients. This relationship underpins much of the strategy behind the upscaling of LLMs, vision transformers, and a host of other architectures, guiding compute-optimal scaling and resource allocation (2502.12051).

However, empirical findings across domains such as vision, language, reinforcement learning, fine-tuning, and within contemporary sparse, modular, and retrieval-augmented models demonstrate regular and reproducible breakdowns of this universal law. This breakdown is often non-random: it is shaped by architecture, learning regime, data structure, regularization, and other contextual factors (2210.14891, 2502.12051, 2210.16859).

2. Formalizing Broken Neural Scaling Laws

To account for deviations from simple power-law trends, BNSLs introduce a more flexible, parametrized framework that can express piecewise, non-monotonic, or inflected scaling curves observed in practice. The canonical functional form for a BNSL, as formalized in Caballero et al., is:

y=a+bxc0i=1n(1+(xdi)1/fi)cifiy = a + b x^{-c_0} \prod_{i=1}^n \left(1 + \left(\frac{x}{d_i}\right)^{1/f_i}\right)^{-c_i f_i}

with yy as the evaluation metric (e.g., test error), xx as the scaled resource (e.g., model size), aa as irreducible error, b,c0b, c_0 as initial scaling offset and exponent, and the product terms parameterizing break locations (did_i), change in slope (cic_i), and break sharpness (fif_i) (2210.14891). This structure recovers classic power laws in the absence of breaks (n=0n=0), but can flexibly accommodate phenomena such as:

  • Double descent: Error or loss may first increase, then decrease with increasing scale—producing a non-monotonic "bump."
  • Sharp inflection points: Abrupt transitions in scaling slope, sometimes corresponding to emergent capabilities (e.g., arithmetic, OOD detection).
  • Plateaus: Regions in scaling where additional resources do not produce further improvement.
  • Delayed improvements: Scaling benefits initiate only after surpassing a critical resource threshold.

These features, consistently demonstrated in large benchmarks across vision, language, reinforcement learning, multimodal, and emergent-task domains, make BNSLs empirically and practically more accurate than monotonic fits for both interpolation and extrapolation tasks.

3. Practical Manifestations and Empirical Regimes

Broken scaling behaviors have been documented in:

  • Sparse models: Effective improvements plateau earlier, as increasing sparsity leads to diminishing returns (2502.12051).
  • Mixture-of-Experts (MoE) and modular routing models: Scaling the total parameter count via expert multiplication only improves performance up to a threshold, beyond which further gains saturate or diverge from dense-law predictions. Performance depends on both total parameters and per-example "activated" parameters (2502.12051).
  • Retrieval-augmented models: These models fundamentally alter scaling properties—the effective scaling variable includes both model parameters and the size of retrieval memory, breaking standard saturation and potentially beating standalone model scaling (2502.12051).
  • Multimodal and vision-LLMs: Competition between modalities for representational capacity yields non-additive scaling, with phase transitions between competing and synergistic scaling regimes that invalidate naive power-law superposition (2502.12051).
  • Pruning and data efficiency interventions: Data selection or pruning can yield better-than-power-law scaling (even exponential in ideal cases), in explicit contradiction to classic laws (2206.14486).
  • Architectural and optimization changes: Substituting training algorithms (e.g., replacing backpropagation with direct feedback alignment) can break scaling behavior, leading to higher loss at all scales and changing the fundamental efficiency curve (2210.14593).

Table: Examples of Empirical BNSLs and Causal Factors

Domain/Setting Broken Scaling Phenomenon Causal Mechanism or Condition
MoE/Routed Networks Plateau after expert scaling Limited per-path parameterization
Retrieval-Augmented Sub-linear model scaling, no saturation Dynamic external memory function
Multimodal Competition/synergy regimes Modality interaction, phase competition
Data Pruning Exponential (not power-law) scaling Optimal information retention
DFA vs. Backpropagation Shallower scaling slope, higher offset Algorithm inefficiency
Downstream/OOD generalization Non-monotonic/delayed scaling Transfer phase, misalignment

4. Predictability, Extrapolation, and Fundamental Limits

While BNSLs provide a superior fit for observed scaling phenomena, their increased flexibility brings theoretical and practical caution for extrapolation:

  • Prediction limits: For extremely sharp regime changes (large ci|c_i|, small fif_i), accurate extrapolation across future transitions is fundamentally impossible without post-transition data—highlighting the foreseeability limit of emergent behaviors or new phases (2210.14891).
  • Fidelity of extrapolation estimators: Methods such as the M4 estimator are designed to detect and accommodate BNSL-regimes by modeling both sigmoidal transitions and asymptotic power-laws, outperforming naive fits in predicting downstream or wild-regime metrics (2209.06640).
  • Implications for AI forecasting: Accurate modeling of BNSLs is critical for projecting new capability onsets, resource budgeting, and risk management in alignment and safety contexts (2210.14891).

5. Theoretical Origins and Diagnostic Use

Theoretical work situates the emergence and breakdown of scaling laws in:

  • Statistical properties of data: Scaling relies on the power-law spectral structure of empirical covariance matrices or data manifolds. Once model/data size outgrows the effective latent dimension, performance plateaus and scaling laws fail (BNSL onset) (2210.16859).
  • Resource allocation: In composite or modular tasks, additive and proportional allocation of resources to subtasks underlies scaling; deviations (e.g., bottlenecks, abrupt module emergence) cause scaling breaks (2402.05164).
  • Implicit optimization bias: The learning trajectory and generalization, influenced by the implicit bias of the optimizer, drive the emergence or breakdown of scaling laws, with different architectural classes exhibiting variable transitions (2505.13230).
  • Symmetry/duality breaking: In large-N field-theoretic models, breaking symmetry between sample and feature scaling (e.g., via noise, regularization, or rich representation learning) can precipitate BNSLs, including double descent (2405.19398).

Diagnostically, BNSLs are revealed by:

  • Non-monotonicity in learning curves (e.g., double descent).
  • Abrupt or sharp regime transitions (inflection or phase change).
  • Plateaus or unexpected saturation in error upon further scaling.
  • Shifts in scaling exponents associated with architecture or data change.

6. Impact on Model Development and Adaptive Strategies

The presence of BNSLs compels moving from universal scaling heuristics to adaptive, architecture- and context-tailored scaling strategies:

  • Compute/inference-aware scaling: When inference costs are non-trivial, optimal model sizing may deviate from power-law expectations (2502.12051).
  • Data curation and mixture-aware approaches: Proper data mixture and curation can enable sub-exponential or improved scaling, while additional data of poor utility can break expected laws (2206.14486, 2502.12051).
  • Joint resource allocation: Scaling must take into account the interplay of model size, data, compute, architecture, and external resources (e.g., retrieval memory) for optimal outcomes.
  • Monitoring and diagnosing breaks: Continuous empirical assessment for phase changes, plateau onset, or non-monotonicity is essential, along with tailored remediation (e.g., reallocation, regularization, architecture modification).

In summary, Broken Neural Scaling Laws provide a unifying formalism and predictive diagnostic for the diverse and often non-universal patterns observed in the scaling behavior of modern neural networks. Accurately modeling and responding to BNSLs is essential for effective scaling, principled extrapolation, and robust deployment across the rapidly expanding spectrum of neural architectures and learning paradigms.