Spectral Properties & NTK Analysis
- Spectral Properties and NTK Analysis is a framework that decomposes kernels into eigenmodes to understand learning dynamics and generalization in wide neural networks.
- The analysis shows that eigenvalue decay drives spectral bias, with low-frequency modes learned first and high-frequency modes determining phase transitions.
- Studies across architectures like fully connected networks, ResNets, and CNNs demonstrate how spectral properties govern training stages and inform generalization bounds.
Spectral properties and Neural Tangent Kernel (NTK) analysis provide a rigorous mathematical framework for understanding learning dynamics, generalization, and inductive biases in wide neural networks. This perspective revolves around the eigendecomposition of kernels—especially the NTK—associated with neural architectures, which governs how different components of target functions are learned under gradient-based optimization. Key phenomena such as spectral bias, learning stages, and generalization bounds are direct consequences of the spectral properties of the underlying kernels. Below, the main principles, methods, results, and their implications are synthesized from contemporary research.
1. Kernel Spectral Decomposition and Eigenstructure
The NTK, as well as other associated kernels (e.g., Conjugate Kernel, Mercer kernels), admit spectral decompositions governed by Mercer's theorem when defined on compact domains:
where are degree- spherical harmonics on , their multiplicity (degeneracy), and the eigenvalue for the th frequency (Bordelon et al., 2020, Cao et al., 2019). For dot-product and neural tangent kernels, the spectral decomposition aligns with classical harmonics (e.g., Hermite polynomials for Gaussian input, Gegenbauer polynomials for the sphere, Boolean harmonics for the hypercube) (Yang et al., 2019, Dandi et al., 2021). The eigenspectrum quantitatively encodes which function spaces are "preferred" (fit more rapidly) by a given kernel.
2. Spectral Bias and Learning Dynamics
Wide neural networks—and their NTK-driven linearized training—exhibit a characteristic spectral bias: eigenmodes (features) corresponding to larger kernel eigenvalues are fit more rapidly under gradient descent. Concretely, for kernel regression or NTK-gradient flow, the coefficient on the th eigenfunction evolves as:
where learning speed in each direction is dictated by 0 (Cao et al., 2019, Bordelon et al., 2020). Empirically and theoretically, low-frequency components (larger 1) are always fit and generalized earlier, whereas high-frequency components (small 2) decay slowly, leading to distinct learning stages or phase transitions as sample size or training time grows (Bordelon et al., 2020, Fan et al., 2020).
3. Influence of Network Architecture on Spectral Properties
The NTK spectrum is controlled both by depth, architecture, and activation function:
- Fully Connected ReLU Networks: The NTK is a rotationally invariant kernel whose spectrum decays polynomially in frequency. For the 3th spherical harmonic mode: 4 in the high-dimensional regime (Bordelon et al., 2020, Yang et al., 2019). Increasing network depth "whitens" the spectrum, slowing the eigenvalue decay and enabling faster learning of high-frequency modes.
- Residual Networks: The ResNet-NTK possesses the same eigenfunctions (spherical harmonics) and retains polynomial decay 5. The spectrum's spikiness (localization near the diagonal and underrepresentation of mid-frequencies) can be controlled by adjusting the skip-connection hyperparameter (Belfer et al., 2021).
- Polynomial Nets (PNNs): PNNs equipped with Hadamard products display a much slower eigenvalue decay (6 for degree 7) than standard fully connected networks (8), leading to more efficient learning of high-frequency modes and enhanced extrapolation capabilities beyond the support of training data (Wu et al., 2022).
- Convolutional Architectures: For neural tangent kernels derived from CNNs, the eigenfunctions are products of spherical harmonics over the channel and spatial dimensions. Eigenvalue decay can be quantified and is slower for spatially localized (few-pixel) patterns, giving CNNs a localized, hierarchical spectral bias and superior sample efficiency for local dependencies (Geifman et al., 2022).
4. Spectral Decomposition and Generalization Error
The generalization error (for kernel regression or NTK-based training) admits an explicit mode-wise decomposition via the kernel eigenspectrum (Bordelon et al., 2020, Mysore et al., 9 Dec 2025):
9
0
where each mode's error decays with sample size 1 governed by its eigenvalue, and 2, 3 are self-consistently determined by the spectrum. In the ridgeless limit (4), mode-wise error simplifies:
5
yielding a 6 decay per mode and precise learning curves tied to the kernel and target function power spectra. These results, corroborated by empirical studies on synthetic data and real datasets such as MNIST, show that the NTK spectrum entirely orders generalization dynamics: modes with larger 7 are perfectly learned first as training/sample size increases (Bordelon et al., 2020, Takeuchi et al., 24 Jul 2025, Mysore et al., 9 Dec 2025).
5. Learning Stages, Architectural Effects, and Phase Transitions
Spectral analysis identifies learning stages corresponding to transitions where additional frequency bands are learned as the training set size or network capacity crosses certain thresholds. In high-dimensional spaces for dot product kernels (including NTK), 8 and 9, so as 0:
- Modes 1: "perfectly learned" (modewise error approaches zero)
- Modes 2: learning in transition
- Modes 3: not yet fit, error remains near initial value
Consequently, learning proceeds as a series of spectral transitions, successively fitting higher-frequency modes. This phenomenon is robust to moderate deviations in data distribution and persists across architectures with the same underlying symmetry (Bordelon et al., 2020, Cao et al., 2019, Wang et al., 2022).
In linear-width regimes (where network width scales with dataset size), the empirical spectral distribution (ESD) of kernel matrices obeys deterministic limiting laws (e.g., Marčenko–Pastur), and bulk invariance is maintained under small learning rates. Large step sizes or adaptive optimization induce phase transitions: isolated "spike" eigenvalues emerge, corresponding to feature learning and alignment with target or spurious directions (Wang et al., 2022, Fan et al., 2020).
6. Spectral Generalization Bounds, Random Features, and Operator-Valued Extensions
Spectral properties directly control generalization bounds. For finite samples and regularization, the expected generalization error admits a spectral expansion:
4
with 5 accounting for finite-width or stochastic effects (Mysore et al., 9 Dec 2025, Bordelon et al., 2020). Enriching the spectrum (e.g., through Fourier features or residual scaling) increases the smallest eigenvalues and sharpens generalization.
Random feature methods can be analyzed through the same spectral lens. Given appropriate spectral filters, minimax optimal rates for regression in the RKHS determined by the kernel hold provided the number of features 6 matches the effective dimension, scaling as 7 for standard Tikhonov or as 8 for source regularity 9 and eigenvalue decay rate 0 (Nguyen et al., 19 Jun 2025, Nguyen et al., 1 Mar 2026, Han et al., 2021).
These results extend to operator-valued kernels and neural operators, with Mercer-type expansions and tight control of the minimax learning rate, even in misspecified regimes (Nguyen et al., 1 Mar 2026).
7. Architectural and Optimization Mechanisms for Spectral Control
Recent work has established mechanisms to control and exploit spectral properties through informed architectural or algorithmic modifications:
- Fourier feature embeddings increase spectral support in high-frequency modes, reducing spectral bias and improving convergence on rapidly varying targets (Mysore et al., 9 Dec 2025).
- Residual scaling and stochastic depth can be tuned to control the growth of maximal kernel eigenvalues (stability) and prevent departures from the linearized NTK regime (Mysore et al., 9 Dec 2025, Belfer et al., 2021).
- Adaptive optimizers (e.g., Adam) generate heavy-tailed spectra, learning multiple target directions and often correlating with improved test accuracy (Wang et al., 2022).
For physics-informed neural networks (PINNs), it is established that the introduction of a differential operator in the loss does not in general accelerate tail decay of the NTK spectrum or enhance learning of high-frequency modes. Advanced activation functions (e.g., periodic, SIREN-type) and loss-balancing heuristics can partially alleviate spectral bias in these setups (Saadat et al., 2022, Gan et al., 14 Mar 2025, Faroughi et al., 9 Jun 2025).
References Table
| Topic | Key Results and Methods | arXiv ID |
|---|---|---|
| Mercer decomposition, spectral decay, learning stages | Analytical formulae for mode-wise error, "successive mode fit", 1, distinct learning stages | (Bordelon et al., 2020) |
| NTK spectral bias theory | Decomposition of training process along kernel eigenfunctions; low-frequency fit first | (Cao et al., 2019) |
| Architectural effects: Residuals, PNNs, convolution | ResNTK and FC-NTK share spectrum; PNNs have heavier spectral tails, CNNs exhibit spatially localized spectra | (Belfer et al., 2021, Wu et al., 2022, Geifman et al., 2022) |
| Spectrum and generalization bounds | Explicit spectral error bounds; enrichment increases min-eigenvalue, improves error | (Mysore et al., 9 Dec 2025) |
| Linear-width & high-dim spectral analysis | Phase transition, bulk invariance, Marčenko–Pastur, emergence of spikes and heavy tails | (Fan et al., 2020, Wang et al., 2022) |
| Random features and spectral analysis | Finite-sample, RF, and operator-valued kernel generalization rates, spectral approximation | (Nguyen et al., 19 Jun 2025, Nguyen et al., 1 Mar 2026, Han et al., 2021) |
| Physics-informed networks & differential operators | Differential operators do not induce faster decay; spectral bias is robust; periodic activations flatten spectrum | (Saadat et al., 2022, Gan et al., 14 Mar 2025, Faroughi et al., 9 Jun 2025) |
Spectral properties and NTK analysis provide a unified, quantitative, and predictive toolkit for understanding the differential learnability of features in wide neural networks. The eigenspectrum fully determines the learning curve, orders convergence by “simplicity,” and dictates both algorithmic limitations and pathways for architectural improvement. The core principle is that kernel spectra act as an "inductive filter," enforcing a bias toward low-complexity solutions, structuring generalization, and, through explicit manipulation, enabling informed control of network learning dynamics.