Papers
Topics
Authors
Recent
Search
2000 character limit reached

Power-Law Generalization Error Analysis

Updated 23 March 2026
  • Power-law generalization error is a phenomenon where the error decays as a power law of training samples, model parameters, or compute, mirroring the decay of data covariance spectra.
  • It leverages spectral methods like kernel ridge regression and NTK to determine bias-variance trade-offs and optimal regularization, underscoring the role of eigenstructure.
  • Recent studies extend these insights to deep neural networks and transformers, guiding resource allocation and explaining error saturation and scaling behaviors.

Power-law generalization error refers to the empirically and theoretically observed phenomenon that, in a wide range of machine learning models—notably deep neural networks and kernel methods—the generalization error decays as a power law of relevant resources such as the number of training samples, model parameters, training time, or total computation. This scaling is particularly pronounced when the spectrum of the data covariance (or the associated kernel in kernel methods) decays as a power law, imprinting its spectral exponent onto the achievable generalization curve. Recent theoretical advances and empirical studies have elucidated the universality, mechanisms, exponents, and limitations of such power-law scaling in diverse settings ranging from SGD-trained two-layer networks to infinitely wide neural networks in the kernel regime, deep regression architectures, and transformers.

1. Theoretical Origins: Spectral Power Laws and Their Consequences

The foundational mechanism behind power-law generalization error lies in the eigenstructure of the data covariance or the kernel operator. Specifically, if the spectrum {λi}\{\lambda_i\} of the data covariance or reproducing kernel Hilbert space (RKHS) operator satisfies

λiiα(α>1),\lambda_i \sim i^{-\alpha} \quad (\alpha > 1),

then any regression or classification error decomposes into contributions from each eigendirection, with the overall error dominated by the slowest-decaying (i.e., most “informative” or “least regularized”) tail of the spectrum.

In kernel ridge regression (KRR), with a “source” or smoothness condition on the target ff^* (e.g., coefficients aiλis/2a_i \sim \lambda_i^{s/2}), the optimal bias-variance trade-off for regularization parameter λnθ\lambda \sim n^{-\theta} yields the minimax excess-risk rate

R(n)nmin(s,2)αmin(s,2)α+1,R(n) \sim n^{-\frac{\min(s,2)\alpha}{\min(s,2)\alpha+1}},

where nn is the sample size, and ss is the Sobolev smoothness, with saturation at s=2s=2 for classical KRR (Li et al., 2023, Li et al., 2024, Velikanov et al., 2024, Jin et al., 2021). For more general “analytic spectral” algorithms, including kernel gradient descent and gradient flow (corresponding in the infinite-width limit to neural networks in the NTK regime), the power-law exponents are directly governed by the spectrum and source conditions without saturation (Li et al., 2024, Velikanov et al., 2024).

2. Power-Law Decay in Learning Curves: Regimes and Exponents

When a power-law data or kernel spectrum is present, the generalization error curve E(n)E(n) or R(n)R(n) as a function of sample size, model size, or compute follows explicitly computable power laws, contingent on noise and target smoothness. Representative cases include:

  • Noisy learning, optimal spectral methods: For a target with coefficient decay ci2iβ1c_i^2 \sim i^{-\beta-1}, the optimal generalization rate is

Rnnββ+1R_n \asymp n^{-\frac{\beta}{\beta+1}}

universally for a broad class of spectral estimators including KRR and kernel gradient descent, both in Gaussian/Wishart and structured translation-invariant data models (Velikanov et al., 2024).

  • Saturation phenomenon: In the noiseless case (σ2=0\sigma^2=0) and when target smoothness β>2α\beta > 2\alpha, even the optimal spectral filter cannot beat the rate Rnn2αR_n \sim n^{-2\alpha} for large nn; for lower smoothness β<2α\beta < 2\alpha, RnnβR_n \sim n^{-\beta} (Velikanov et al., 2024).
  • Universal constants and regimes: Empirically, the leading-order coefficient in the power law is universal with respect to detailed model structure in the noisy case, but not in the noiseless setting, suggesting statistical universality (Velikanov et al., 2024).
  • Kernel Ridge Regression (KRR) and NTK: With eigenvalue decay λiiβ\lambda_i \sim i^{-\beta}, the NTK (“lazy-training”) regime inherits the same bias–variance trade-off curves and exponents; benign overfitting (i.e., interpolation + vanishing test error) is only possible if the noise σ2=o(1)\sigma^2 = o(1) as nn \to \infty (Li et al., 2023, Li et al., 2024).
  • Dynamical regimes in SGD and early stopping: Dynamical mean-field theory (DMFT) for gradient flow or SGD on random feature models with a power-law kernel spectrum (ρ(λ)λα\rho(\lambda) \sim \lambda^{-\alpha}) yields a generalization error decay Eg(t)t(2α)E_g(t) \propto t^{-(2-\alpha)} as a function of time/compute, up to a variance-stabilized floor (Kramp et al., 26 Feb 2026).

3. Power-Law Scaling Laws in Deep Neural Networks and Transformers

Empirical and theoretical studies in modern neural networks corroborate, extend, and contextualize the classic spectral results:

  • Two-layer ReLU and erf Networks: For SGD-trained two-layer networks in the student-teacher framework and power-law data spectra λα\lambda_\ell \sim \ell^{-\alpha}, the generalization error decays as

Egen(α)αγtime,γtime=α1αE_{\text{gen}}(\alpha) \sim \alpha^{-\gamma_{\text{time}}}, \quad \gamma_{\text{time}} = \frac{\alpha-1}{\alpha}

after a crossover from early exponential to power-law learning (Worschech et al., 2024). For nonlinear (erf) activations with MM hidden units, similar exponents arise asymptotically post-symmetry-breaking, with critical crossovers and plateau behaviors governed by the spectrum and model size.

  • Deep Regression Architectures: Data-scaling exponents αD\alpha_D in deep regression (e.g., fully connected networks, resnets, transformers) empirically span $0.8$ to over $2$, sometimes substantially steeper than in LLMs (<1<1) (Cadez et al., 12 Sep 2025). This steeper scaling is linked to low-noise, highly structured regression targets.
  • Transformers and Compute Scaling: The generalization curve for transformers trained via SGD exhibits two regimes: an initial exponential risk decay in total compute CC, and a late-phase statistical bound of Θ(C1/6)\Theta(C^{-1/6}), derived via ODE/NTK approximations and optimal balancing of sample/model/training resources (Yang, 26 Dec 2025).
  • Unified scaling forms: Joint model–data scaling is captured by additive power-law forms,

ϵ(M,N)=aNα+bMβ+c,\epsilon(M,N) = aN^{-\alpha} + bM^{-\beta} + c_\infty,

robust over model architectures, dataset sizes, and optimization procedures (Rosenfeld et al., 2019). These forms accurately predict error across orders of magnitude and underpin optimal resource allocation.

Scenario/domain Error scaling behavior Reference
KRR/NTK, smooth target Rnnmin(γ1,γ2)R_n \sim n^{-\min(\gamma_1,\gamma_2)} (Li et al., 2023)
KRR with power-law spectrum R(n)nmin(s,2)αmin(s,2)α+1R(n) \sim n^{-\frac{\min(s,2)\alpha}{\min(s,2)\alpha+1}} (Li et al., 2023)
SGD (quadratic feature-learn) Rnnγqua,  γqua=11/βR_n \sim n^{-\gamma_{\text{qua}},\;\gamma_{\text{qua}}=1-1/\beta} (Ding et al., 13 Feb 2025)
Two-layer SGD, power-law Egen(α)α(α1)/αE_{\text{gen}}(\alpha) \sim \alpha^{-(\alpha-1)/\alpha} (Worschech et al., 2024)
Transformer, compute scaling excess risk C1/6\sim C^{-1/6} (statistical phase) (Yang, 26 Dec 2025)

4. Power-law Attunement in ERM and Hypothesis Spaces

Beyond spectral characterizations, power-law laws in empirical risk minimization (ERM) arise from the “attunement” exponent of the risk density: ρ(r)C  rα1,r0+\rho(r) \sim C\;r^{\alpha-1}, \quad r \to 0^{+} where ρ(r)\rho(r) is the density of hypotheses with population risk at most rr (Marcu et al., 2019). For large sample size mm, in the realizable binary-loss case, the expected ERM error decays as E[Rerm]m1E[R_{\text{erm}}]\propto m^{-1} universally, with the attunement exponent α\alpha affecting only pre-factors. Thus, the density of low-risk hypotheses, rather than the raw VC dimension, controls the rate at which ERM generalizes—explaining the common empirical $1/m$-type laws.

5. Implications, Practical Optimization, and Limitations

  • Resource allocation and optimization: The analytic scaling forms permit principled budgeting between data and model size. For target error ϵ\epsilon^*, the optimal allocation satisfies Nα/Mβ=α/(bβ)N^\alpha/M^\beta = \alpha/(b\beta), allowing for minimum FLOP solutions, maximal useful model/data size at fixed resource, and accurate scaling projections from small pilot runs (Rosenfeld et al., 2019).
  • Role of noise and regularization: In kernel and infinite-width regimes, a fixed level of label noise sets an irreducible error (no further decay possible via regularization or increased data), and benign overfitting requires vanishing noise with nn (Li et al., 2023, Li et al., 2024).
  • Saturation and loss localization: When the target function is too smooth relative to the spectrum (β>2α\beta>2\alpha), the achievable error rate is capped (saturation), and the error curve localizes on large-eigenvalue directions; power-law scaling is otherwise observed in the middle spectrum (Velikanov et al., 2024, Li et al., 2024).
  • Early stopping and training dynamics: In dynamical settings (SGD, Langevin dynamics), power-law decays are observed in time/sample number up to a crossover set by variance growth, leading to an optimal stopping point that achieves the predicted exponent (Kramp et al., 26 Feb 2026).
  • Empirical universality and architecture-independence: Observed exponents and scaling forms are robust across architectures (FCN, ResNet, Transformer), training algorithms (SGD, Adam), and data/label-noise regimes, reflecting a universal behavior rooted in the spectral tail (Cadez et al., 12 Sep 2025, Rosenfeld et al., 2019, Kramp et al., 26 Feb 2026).

6. Open Questions and Frontiers

  • Exponents >1>1 in regression: Empirically observed exponents in deep regression (αD>1\alpha_D>1) exceed those predicted by prevailing theories based on LLMs or shallow kernels, raising questions about the impact of data smoothness, model expressivity, and structured target functions in regression tasks (Cadez et al., 12 Sep 2025).
  • Beyond kernel and NTK regimes: Classical scaling law theory is best understood for infinite-width/lazy-training settings (NTK/kernel). The extension to strongly feature-learning, highly non-linear, or deep networks with non-trivial inductive biases remains an active and only partially understood research frontier (Worschech et al., 2024, Ding et al., 13 Feb 2025).
  • Interplay of resource scaling and spectral characteristics: Simultaneous scaling of model size, data, and compute, especially in regimes where the spectrum itself may evolve with these factors, is under investigation (Yang, 26 Dec 2025, Rosenfeld et al., 2019).
  • Universality in high-dimensional, structured, and real-data regimes: While theoretical universality has been demonstrated in model/dataset settings with explicitly constructed spectra, its validity in realistic, heterogeneous, or evolving data distributions is the subject of ongoing work (Velikanov et al., 2024, Kramp et al., 26 Feb 2026).

7. Summary Table of Power-Law Regimes

Learning Setting Error Decay Law Exponent Formula Reference
KRR/NTK, noisy, smooth Rnnββ+1R_n\sim n^{-\frac{\beta}{\beta+1}} β/(β+1)\beta/(\beta+1) (Velikanov et al., 2024)
SGD, two-layer, power-law E(α)αα1αE(\alpha) \sim \alpha^{-\frac{\alpha-1}{\alpha}} (α1)/α(\alpha-1)/\alpha (Worschech et al., 2024)
Kernel spectral, source ss nsβ/(sβ+1)n^{-s\beta/(s\beta+1)} sβ/(sβ+1)s\beta/(s\beta+1) (Li et al., 2023)
Transform. compute scaling Θ(C1/6)\Theta(C^{-1/6}) (data-limited) $1/6$ (Yang, 26 Dec 2025)
Quadratic SGD (feature-learn) n(11/β)n^{-(1-1/\beta)} 11/β1-1/\beta (Ding et al., 13 Feb 2025)
ERM, 0-1 loss, “attunement” EERMm1E_{\text{ERM}} \sim m^{-1} $1$ (prefactor varies) (Marcu et al., 2019)
Noiseless, smooth, saturating n2αn^{-2\alpha} 2α2\alpha (saturation) (Velikanov et al., 2024)

These regimes highlight the centrality of spectral decay and resource allocation in governing generalization performance and provide an analytic framework for prediction and optimization in the era of large-scale machine learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Power-Law Generalization Error.