Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
109 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
35 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
5 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

NTK Spectral Decomposition: Analysis & Insights

Updated 25 July 2025
  • Spectral Decomposition of NTK is a framework that expresses the kernel as eigenfunctions and eigenvalues, elucidating the internal dynamics of deep networks.
  • The method leverages integral operator techniques and harmonic analysis to quantify convergence rates and feature complexity across varied input domains.
  • It contrasts NTK with conjugate kernels, highlighting how eigenvalue distributions inform implicit regularization and guide hyperparameter tuning in overparameterized models.

The spectral decomposition of the Neural Tangent Kernel (NTK) provides a rigorous framework to analyze the learning dynamics, generalization properties, and functional biases of deep neural networks—especially in the infinite-width or overparameterized regimes. The NTK describes the evolution of the network function under gradient descent, with its eigenspectrum offering deep insights into implicit regularization, feature complexity bias, and optimization efficiency.

1. Integral Operator Framework and Eigen-decomposition

The NTK induces a symmetric kernel operator on the space of square-integrable functions (relative to the input distribution). Its action can be written as

(Kf)(x)=K(x,x)f(x)dμ(x),(K f)(x) = \int K(x, x') f(x')\,d\mu(x'),

where KK is the NTK and μ\mu the input measure. The kernel admits a spectral decomposition

K(x,x)=iλiui(x)ui(x),K(x, x') = \sum_{i} \lambda_i u_i(x) u_i(x'),

with {λi}\{\lambda_i\} the non-increasingly ordered eigenvalues and {ui}\{u_i\} the corresponding orthonormal eigenfunctions. For multilayer perceptrons (MLPs) parametrized in the NTK regime, K(x,x)K(x, x') can often be written as a zonal function of the normalized inner product:

K(x,y)=Φ(x,yxy),K(x, y) = \Phi\left(\frac{\langle x, y \rangle}{\|x\|\|y\|}\right),

enabling explicit harmonic analytic computations on isotropic domains (Yang et al., 2019).

On the Boolean cube {±1}d\{ \pm 1 \}^d, the Fourier basis χS(x)=iSxi\chi_S(x) = \prod_{i \in S} x_i diagonalizes any kernel of the form K(x,y)=Φ(x,y)K(x, y) = \Phi(\langle x, y \rangle), with explicit eigenvalues given by

μS=Ex[xSΦ(1dixi)].\mu_{|S|} = \mathbb{E}_{x}[x^S \cdot \Phi(\tfrac{1}{d}\sum_{i} x_i)].

On the dd-dimensional sphere, the eigenfunctions are spherical harmonics and the expansion involves Gegenbauer polynomials; the kk-th degree eigenvalues scale in high dimension as

limddk(eigenvalue at degree k)=Φ(k)(0).\lim_{d \to \infty} d^k \cdot (\text{eigenvalue at degree }k) = \Phi^{(k)}(0).

2. Spectral Bias, Simplicity, and Feature Complexity

Spectral analysis reveals an "implicit bias" or "simplicity bias"—the tendency for the NTK to place more variance (or larger eigenvalues) on eigenfunctions encoding lower-complexity (lower-frequency) functions. This manifests in the dynamics of gradient descent: each eigencomponent of the target function is learned at a rate determined by the corresponding kernel eigenvalue. Specifically, under kernel gradient descent,

g(t+1)=g(t)2αK(g(t)g),g^{(t+1)} = g^{(t)} - 2\alpha K (g^{(t)} - g^*),

components with larger NTK eigenvalues converge more rapidly (Yang et al., 2019, Cao et al., 2019).

For inputs on the sphere, the NTK decomposes into spherical harmonics, and lower degree harmonics (i.e., smoother, less oscillatory features) are preferentially fit—explaining empirical observations of spectral bias in deep learning. Theoretical and numerical studies confirm that low-frequency components dominate early learning, and the rate of resolution of higher-complexity structure is bottlenecked by the decay of the NTK eigenvalues (Cao et al., 2019).

However, this bias can be modulated: using nonlinearities (such as erf) with large weight variance or increased depth, the dominant eigenvalues do not always correspond to the simplest functions, reducing or removing the simplicity bias (Yang et al., 2019).

3. Comparisons with Conjugate Kernel (CK/NNGP) and Layerwise Training

The CK (Neural Network Gaussian Process, NNGP) kernel captures network behavior at initialization or under last-layer-only training, while the NTK governs training when all layers are updated. Both admit similar spectral decompositions, but significant differences arise (Yang et al., 2019, Fan et al., 2020):

  • CK spectra often place most energy on low-degree (simple) components—serious simplicity bias.
  • NTK spectra distribute a larger fraction of their "trace" onto higher-degree (complex) features, especially as network depth increases.
  • Thus, all-layer (NTK) training allows better learning of complex targets that are invisible to the CK.

The NTK also robustly predicts the highest safe learning rate for gradient descent, matching empirical and theoretical results that the allowable learning rate often scales as 1/Φ(0)1/\Phi(0).

4. Efficient Spectral Computation in Symmetric Domains

The analysis of NTK spectra is tractable when the input distribution is highly symmetric. On the Boolean cube, explicit combinatorial formulas allow rapid determination of all eigenvalues:

μk=2d(ITΔ)k(I+TΔ)dkΦ(1)\mu_k = 2^{-d}(I - T_\Delta)^k (I + T_\Delta)^{d-k} \Phi(1)

where TΔT_{\Delta} is a shift operator. This approach yields efficient and numerically stable computation compared to integration against spherical harmonics or Gegenbauer polynomials (Yang et al., 2019).

In high dimensions, the kernel spectra turn out to be asymptotically equivalent on the Boolean cube, the sphere, or for isotropic Gaussian data—justifying the universal application of these algorithms to many practical settings.

5. Deterministic and Random Matrix Theory Limits

Extending to the regime where network width scales linearly with training sample size ("linear width" regime), rigorous random matrix theory analyses show that the NTK eigenvalue distribution converges to a deterministic limit. The limit for the CK is described by recursive Marčenko-Pastur maps across hidden layers; for the NTK it is a linear combination of CK matrices across layers, given by fixed-point equations extending the Marčenko-Pastur law (Fan et al., 2020, Wang et al., 2021).

In the "ultra-wide" regime (width \gg sample size), the NTK’s centered and normalized empirical spectral distribution converges to a deformed semicircle law—the spectrum being fully characterized by a self-consistent equation involving the input Gram structure and nonlinearity parameters. This provides quantitative estimates for eigenvalue support and conditions for global convergence during training (Wang et al., 2021).

6. Experimental Validation and Practical Implications

Empirical studies confirm theoretical predictions about NTK spectra:

  • Fractional variance (proportion of trace) assigned to high-degree functions increases with network depth, enhancing the network’s expressivity for complex functions.
  • When training on Boolean or real datasets, the empirically measured eigenvalue spectra of the NTK closely match analytic predictions from the relevant formulae or random matrix limits, provided the data distribution is close to uniform or orthogonalized.
  • Maximal nondiverging learning rates for gradient descent (estimated by binary search) empirically agree with theory (1/Φ(0)1/\Phi(0)), both for synthetic inputs and datasets like MNIST/CIFAR10 (Yang et al., 2019).

These insights have direct implications for hyperparameter selection, e.g., selecting suitable network depth for complex targets and tuning learning rates to maximize convergence without instability.

7. Summary of Key Formulas and Theoretical Results

The comprehensive spectral decomposition reduces the problem to familiar harmonic analysis:

  • On the Boolean cube:

μk=Ex[xSΦ(1dixi)]\mu_k = \mathbb{E}_{x} [x^S \cdot \Phi(\frac{1}{d} \sum_i x_i)]

  • Equivalent operator form:

μk=2d(ITΔ)k(I+TΔ)dkΦ(1)\mu_k = 2^{-d} (I - T_{\Delta})^k (I + T_{\Delta})^{d-k} \Phi(1)

  • High-dimensional asymptotics:

limddk(eigenvalue at degree k)=Φ(k)(0)\lim_{d\to\infty} d^k \cdot (\text{eigenvalue at degree } k) = \Phi^{(k)}(0)

The NTK spectrum not only underlies the implicit regularization (preferring low-complexity features), but fine-grained spectral analysis explains how architecture choices, nonlinearity selection, and hyperparameter variation impact both trainability and generalization.

This spectral viewpoint provides a unifying language linking neural function class, learning algorithm, and complexity-generalization tradeoffs through the precise behavior of kernel eigenvalues and associated eigenspaces.