NTK Spectral Decomposition: Analysis & Insights
- Spectral Decomposition of NTK is a framework that expresses the kernel as eigenfunctions and eigenvalues, elucidating the internal dynamics of deep networks.
- The method leverages integral operator techniques and harmonic analysis to quantify convergence rates and feature complexity across varied input domains.
- It contrasts NTK with conjugate kernels, highlighting how eigenvalue distributions inform implicit regularization and guide hyperparameter tuning in overparameterized models.
The spectral decomposition of the Neural Tangent Kernel (NTK) provides a rigorous framework to analyze the learning dynamics, generalization properties, and functional biases of deep neural networks—especially in the infinite-width or overparameterized regimes. The NTK describes the evolution of the network function under gradient descent, with its eigenspectrum offering deep insights into implicit regularization, feature complexity bias, and optimization efficiency.
1. Integral Operator Framework and Eigen-decomposition
The NTK induces a symmetric kernel operator on the space of square-integrable functions (relative to the input distribution). Its action can be written as
where is the NTK and the input measure. The kernel admits a spectral decomposition
with the non-increasingly ordered eigenvalues and the corresponding orthonormal eigenfunctions. For multilayer perceptrons (MLPs) parametrized in the NTK regime, can often be written as a zonal function of the normalized inner product:
enabling explicit harmonic analytic computations on isotropic domains (Yang et al., 2019).
On the Boolean cube , the Fourier basis diagonalizes any kernel of the form , with explicit eigenvalues given by
On the -dimensional sphere, the eigenfunctions are spherical harmonics and the expansion involves Gegenbauer polynomials; the -th degree eigenvalues scale in high dimension as
2. Spectral Bias, Simplicity, and Feature Complexity
Spectral analysis reveals an "implicit bias" or "simplicity bias"—the tendency for the NTK to place more variance (or larger eigenvalues) on eigenfunctions encoding lower-complexity (lower-frequency) functions. This manifests in the dynamics of gradient descent: each eigencomponent of the target function is learned at a rate determined by the corresponding kernel eigenvalue. Specifically, under kernel gradient descent,
components with larger NTK eigenvalues converge more rapidly (Yang et al., 2019, Cao et al., 2019).
For inputs on the sphere, the NTK decomposes into spherical harmonics, and lower degree harmonics (i.e., smoother, less oscillatory features) are preferentially fit—explaining empirical observations of spectral bias in deep learning. Theoretical and numerical studies confirm that low-frequency components dominate early learning, and the rate of resolution of higher-complexity structure is bottlenecked by the decay of the NTK eigenvalues (Cao et al., 2019).
However, this bias can be modulated: using nonlinearities (such as erf) with large weight variance or increased depth, the dominant eigenvalues do not always correspond to the simplest functions, reducing or removing the simplicity bias (Yang et al., 2019).
3. Comparisons with Conjugate Kernel (CK/NNGP) and Layerwise Training
The CK (Neural Network Gaussian Process, NNGP) kernel captures network behavior at initialization or under last-layer-only training, while the NTK governs training when all layers are updated. Both admit similar spectral decompositions, but significant differences arise (Yang et al., 2019, Fan et al., 2020):
- CK spectra often place most energy on low-degree (simple) components—serious simplicity bias.
- NTK spectra distribute a larger fraction of their "trace" onto higher-degree (complex) features, especially as network depth increases.
- Thus, all-layer (NTK) training allows better learning of complex targets that are invisible to the CK.
The NTK also robustly predicts the highest safe learning rate for gradient descent, matching empirical and theoretical results that the allowable learning rate often scales as .
4. Efficient Spectral Computation in Symmetric Domains
The analysis of NTK spectra is tractable when the input distribution is highly symmetric. On the Boolean cube, explicit combinatorial formulas allow rapid determination of all eigenvalues:
where is a shift operator. This approach yields efficient and numerically stable computation compared to integration against spherical harmonics or Gegenbauer polynomials (Yang et al., 2019).
In high dimensions, the kernel spectra turn out to be asymptotically equivalent on the Boolean cube, the sphere, or for isotropic Gaussian data—justifying the universal application of these algorithms to many practical settings.
5. Deterministic and Random Matrix Theory Limits
Extending to the regime where network width scales linearly with training sample size ("linear width" regime), rigorous random matrix theory analyses show that the NTK eigenvalue distribution converges to a deterministic limit. The limit for the CK is described by recursive Marčenko-Pastur maps across hidden layers; for the NTK it is a linear combination of CK matrices across layers, given by fixed-point equations extending the Marčenko-Pastur law (Fan et al., 2020, Wang et al., 2021).
In the "ultra-wide" regime (width sample size), the NTK’s centered and normalized empirical spectral distribution converges to a deformed semicircle law—the spectrum being fully characterized by a self-consistent equation involving the input Gram structure and nonlinearity parameters. This provides quantitative estimates for eigenvalue support and conditions for global convergence during training (Wang et al., 2021).
6. Experimental Validation and Practical Implications
Empirical studies confirm theoretical predictions about NTK spectra:
- Fractional variance (proportion of trace) assigned to high-degree functions increases with network depth, enhancing the network’s expressivity for complex functions.
- When training on Boolean or real datasets, the empirically measured eigenvalue spectra of the NTK closely match analytic predictions from the relevant formulae or random matrix limits, provided the data distribution is close to uniform or orthogonalized.
- Maximal nondiverging learning rates for gradient descent (estimated by binary search) empirically agree with theory (), both for synthetic inputs and datasets like MNIST/CIFAR10 (Yang et al., 2019).
These insights have direct implications for hyperparameter selection, e.g., selecting suitable network depth for complex targets and tuning learning rates to maximize convergence without instability.
7. Summary of Key Formulas and Theoretical Results
The comprehensive spectral decomposition reduces the problem to familiar harmonic analysis:
- On the Boolean cube:
- Equivalent operator form:
- High-dimensional asymptotics:
The NTK spectrum not only underlies the implicit regularization (preferring low-complexity features), but fine-grained spectral analysis explains how architecture choices, nonlinearity selection, and hyperparameter variation impact both trainability and generalization.
This spectral viewpoint provides a unifying language linking neural function class, learning algorithm, and complexity-generalization tradeoffs through the precise behavior of kernel eigenvalues and associated eigenspaces.