Power-Law Generalization Error Analysis

Updated 23 March 2026

Power-law generalization error is a phenomenon where the error decays as a power law of training samples, model parameters, or compute, mirroring the decay of data covariance spectra.
It leverages spectral methods like kernel ridge regression and NTK to determine bias-variance trade-offs and optimal regularization, underscoring the role of eigenstructure.
Recent studies extend these insights to deep neural networks and transformers, guiding resource allocation and explaining error saturation and scaling behaviors.

Power-law generalization error refers to the empirically and theoretically observed phenomenon that, in a wide range of machine learning models—notably deep neural networks and kernel methods—the generalization error decays as a power law of relevant resources such as the number of training samples, model parameters, training time, or total computation. This scaling is particularly pronounced when the spectrum of the data covariance (or the associated kernel in kernel methods) decays as a power law, imprinting its spectral exponent onto the achievable generalization curve. Recent theoretical advances and empirical studies have elucidated the universality, mechanisms, exponents, and limitations of such power-law scaling in diverse settings ranging from SGD-trained two-layer networks to infinitely wide neural networks in the kernel regime, deep regression architectures, and transformers.

1. Theoretical Origins: Spectral Power Laws and Their Consequences

The foundational mechanism behind power-law generalization error lies in the eigenstructure of the data covariance or the kernel operator. Specifically, if the spectrum $\{\lambda_i\}$ of the data covariance or reproducing kernel Hilbert space (RKHS) operator satisfies

$\lambda_i \sim i^{-\alpha} \quad (\alpha > 1),$

then any regression or classification error decomposes into contributions from each eigendirection, with the overall error dominated by the slowest-decaying (i.e., most “informative” or “least regularized”) tail of the spectrum.

In kernel ridge regression (KRR), with a “source” or smoothness condition on the target $f^*$ (e.g., coefficients $a_i \sim \lambda_i^{s/2}$ ), the optimal bias-variance trade-off for regularization parameter $\lambda \sim n^{-\theta}$ yields the minimax excess-risk rate

$R(n) \sim n^{-\frac{\min(s,2)\alpha}{\min(s,2)\alpha+1}},$

where $n$ is the sample size, and $s$ is the Sobolev smoothness, with saturation at $s=2$ for classical KRR (Li et al., 2023, Li et al., 2024, Velikanov et al., 2024, Jin et al., 2021). For more general “analytic spectral” algorithms, including kernel gradient descent and gradient flow (corresponding in the infinite-width limit to neural networks in the NTK regime), the power-law exponents are directly governed by the spectrum and source conditions without saturation (Li et al., 2024, Velikanov et al., 2024).

2. Power-Law Decay in Learning Curves: Regimes and Exponents

When a power-law data or kernel spectrum is present, the generalization error curve $E(n)$ or $R(n)$ as a function of sample size, model size, or compute follows explicitly computable power laws, contingent on noise and target smoothness. Representative cases include:

Noisy learning, optimal spectral methods: For a target with coefficient decay $c_i^2 \sim i^{-\beta-1}$ , the optimal generalization rate is

$R_n \asymp n^{-\frac{\beta}{\beta+1}}$

universally for a broad class of spectral estimators including KRR and kernel gradient descent, both in Gaussian/Wishart and structured translation-invariant data models (Velikanov et al., 2024).

Saturation phenomenon: In the noiseless case ( $\sigma^2=0$ ) and when target smoothness $\beta > 2\alpha$ , even the optimal spectral filter cannot beat the rate $R_n \sim n^{-2\alpha}$ for large $n$ ; for lower smoothness $\beta < 2\alpha$ , $R_n \sim n^{-\beta}$ (Velikanov et al., 2024).
Universal constants and regimes: Empirically, the leading-order coefficient in the power law is universal with respect to detailed model structure in the noisy case, but not in the noiseless setting, suggesting statistical universality (Velikanov et al., 2024).
Kernel Ridge Regression (KRR) and NTK: With eigenvalue decay $\lambda_i \sim i^{-\beta}$ , the NTK (“lazy-training”) regime inherits the same bias–variance trade-off curves and exponents; benign overfitting (i.e., interpolation + vanishing test error) is only possible if the noise $\sigma^2 = o(1)$ as $n \to \infty$ (Li et al., 2023, Li et al., 2024).
Dynamical regimes in SGD and early stopping: Dynamical mean-field theory (DMFT) for gradient flow or SGD on random feature models with a power-law kernel spectrum ( $\rho(\lambda) \sim \lambda^{-\alpha}$ ) yields a generalization error decay $E_g(t) \propto t^{-(2-\alpha)}$ as a function of time/compute, up to a variance-stabilized floor (Kramp et al., 26 Feb 2026).

3. Power-Law Scaling Laws in Deep Neural Networks and Transformers

Empirical and theoretical studies in modern neural networks corroborate, extend, and contextualize the classic spectral results:

Two-layer ReLU and erf Networks: For SGD-trained two-layer networks in the student-teacher framework and power-law data spectra $\lambda_\ell \sim \ell^{-\alpha}$ , the generalization error decays as

$E_{\text{gen}}(\alpha) \sim \alpha^{-\gamma_{\text{time}}}, \quad \gamma_{\text{time}} = \frac{\alpha-1}{\alpha}$

after a crossover from early exponential to power-law learning (Worschech et al., 2024). For nonlinear (erf) activations with $M$ hidden units, similar exponents arise asymptotically post-symmetry-breaking, with critical crossovers and plateau behaviors governed by the spectrum and model size.

Deep Regression Architectures: Data-scaling exponents $\alpha_D$ in deep regression (e.g., fully connected networks, resnets, transformers) empirically span $0.8$ to over $2$, sometimes substantially steeper than in LLMs ( $<1$ ) (Cadez et al., 12 Sep 2025). This steeper scaling is linked to low-noise, highly structured regression targets.
Transformers and Compute Scaling: The generalization curve for transformers trained via SGD exhibits two regimes: an initial exponential risk decay in total compute $C$ , and a late-phase statistical bound of $\Theta(C^{-1/6})$ , derived via ODE/NTK approximations and optimal balancing of sample/model/training resources (Yang, 26 Dec 2025).
Unified scaling forms: Joint model–data scaling is captured by additive power-law forms,

$\epsilon(M,N) = aN^{-\alpha} + bM^{-\beta} + c_\infty,$

robust over model architectures, dataset sizes, and optimization procedures (Rosenfeld et al., 2019). These forms accurately predict error across orders of magnitude and underpin optimal resource allocation.

Scenario/domain	Error scaling behavior	Reference
KRR/NTK, smooth target	$R_n \sim n^{-\min(\gamma_1,\gamma_2)}$	(Li et al., 2023)
KRR with power-law spectrum	$R(n) \sim n^{-\frac{\min(s,2)\alpha}{\min(s,2)\alpha+1}}$	(Li et al., 2023)
SGD (quadratic feature-learn)	$R_n \sim n^{-\gamma_{\text{qua}},\;\gamma_{\text{qua}}=1-1/\beta}$	(Ding et al., 13 Feb 2025)
Two-layer SGD, power-law	$E_{\text{gen}}(\alpha) \sim \alpha^{-(\alpha-1)/\alpha}$	(Worschech et al., 2024)
Transformer, compute scaling	excess risk $\sim C^{-1/6}$ (statistical phase)	(Yang, 26 Dec 2025)

4. Power-law Attunement in ERM and Hypothesis Spaces

Beyond spectral characterizations, power-law laws in empirical risk minimization (ERM) arise from the “attunement” exponent of the risk density: $\rho(r) \sim C\;r^{\alpha-1}, \quad r \to 0^{+}$ where $\rho(r)$ is the density of hypotheses with population risk at most $r$ (Marcu et al., 2019). For large sample size $m$ , in the realizable binary-loss case, the expected ERM error decays as $E[R_{\text{erm}}]\propto m^{-1}$ universally, with the attunement exponent $\alpha$ affecting only pre-factors. Thus, the density of low-risk hypotheses, rather than the raw VC dimension, controls the rate at which ERM generalizes—explaining the common empirical $1/m$-type laws.

5. Implications, Practical Optimization, and Limitations

Resource allocation and optimization: The analytic scaling forms permit principled budgeting between data and model size. For target error $\epsilon^*$ , the optimal allocation satisfies $N^\alpha/M^\beta = \alpha/(b\beta)$ , allowing for minimum FLOP solutions, maximal useful model/data size at fixed resource, and accurate scaling projections from small pilot runs (Rosenfeld et al., 2019).
Role of noise and regularization: In kernel and infinite-width regimes, a fixed level of label noise sets an irreducible error (no further decay possible via regularization or increased data), and benign overfitting requires vanishing noise with $n$ (Li et al., 2023, Li et al., 2024).
Saturation and loss localization: When the target function is too smooth relative to the spectrum ( $\beta>2\alpha$ ), the achievable error rate is capped (saturation), and the error curve localizes on large-eigenvalue directions; power-law scaling is otherwise observed in the middle spectrum (Velikanov et al., 2024, Li et al., 2024).
Early stopping and training dynamics: In dynamical settings (SGD, Langevin dynamics), power-law decays are observed in time/sample number up to a crossover set by variance growth, leading to an optimal stopping point that achieves the predicted exponent (Kramp et al., 26 Feb 2026).
Empirical universality and architecture-independence: Observed exponents and scaling forms are robust across architectures (FCN, ResNet, Transformer), training algorithms (SGD, Adam), and data/label-noise regimes, reflecting a universal behavior rooted in the spectral tail (Cadez et al., 12 Sep 2025, Rosenfeld et al., 2019, Kramp et al., 26 Feb 2026).

6. Open Questions and Frontiers

Exponents $>1$ in regression: Empirically observed exponents in deep regression ( $\alpha_D>1$ ) exceed those predicted by prevailing theories based on LLMs or shallow kernels, raising questions about the impact of data smoothness, model expressivity, and structured target functions in regression tasks (Cadez et al., 12 Sep 2025).
Beyond kernel and NTK regimes: Classical scaling law theory is best understood for infinite-width/lazy-training settings (NTK/kernel). The extension to strongly feature-learning, highly non-linear, or deep networks with non-trivial inductive biases remains an active and only partially understood research frontier (Worschech et al., 2024, Ding et al., 13 Feb 2025).
Interplay of resource scaling and spectral characteristics: Simultaneous scaling of model size, data, and compute, especially in regimes where the spectrum itself may evolve with these factors, is under investigation (Yang, 26 Dec 2025, Rosenfeld et al., 2019).
Universality in high-dimensional, structured, and real-data regimes: While theoretical universality has been demonstrated in model/dataset settings with explicitly constructed spectra, its validity in realistic, heterogeneous, or evolving data distributions is the subject of ongoing work (Velikanov et al., 2024, Kramp et al., 26 Feb 2026).

7. Summary Table of Power-Law Regimes

Learning Setting	Error Decay Law	Exponent Formula	Reference
KRR/NTK, noisy, smooth	$R_n\sim n^{-\frac{\beta}{\beta+1}}$	$\beta/(\beta+1)$	(Velikanov et al., 2024)
SGD, two-layer, power-law	$E(\alpha) \sim \alpha^{-\frac{\alpha-1}{\alpha}}$	$(\alpha-1)/\alpha$	(Worschech et al., 2024)
Kernel spectral, source $s$	$n^{-s\beta/(s\beta+1)}$	$s\beta/(s\beta+1)$	(Li et al., 2023)
Transform. compute scaling	$\Theta(C^{-1/6})$ (data-limited)	$1/6$	(Yang, 26 Dec 2025)
Quadratic SGD (feature-learn)	$n^{-(1-1/\beta)}$	$1-1/\beta$	(Ding et al., 13 Feb 2025)
ERM, 0-1 loss, “attunement”	$E_{\text{ERM}} \sim m^{-1}$	$1$ (prefactor varies)	(Marcu et al., 2019)
Noiseless, smooth, saturating	$n^{-2\alpha}$	$2\alpha$ (saturation)	(Velikanov et al., 2024)

These regimes highlight the centrality of spectral decay and resource allocation in governing generalization performance and provide an analytic framework for prediction and optimization in the era of large-scale machine learning.