Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conjugate Kernel: Theory & Applications

Updated 5 June 2026
  • Conjugate Kernel is defined as the covariance (Gram) matrix of neural network feature maps, revealing the structure and expressivity of each layer.
  • It leverages random matrix theory to derive deterministic equivalents and spectral laws, uncovering phase transitions and outlier eigenvalues.
  • CK offers computational efficiency in kernel learning and has broad applications in neural generalization, operator extensions in PDEs, and superspace analysis.

The Conjugate Kernel (CK) is a central mathematical object in the theory of neural networks, kernel machines, and partial differential equations. In deep learning, CK arises as the covariance (Gram) matrix of the feature map induced by a randomly initialized, feed-forward neural network, capturing the structure and expressivity of the network’s layers. In classical and superspace analysis, the CK-operator encodes Bessel-type functional extensions and plane wave decompositions associated with Dirac and Laplace operators. The modern theory of CK connects random matrix theory, kernel learning, and representations of high-dimensional data, providing key insights into both the spectral properties and learnability of neural architectures.

1. Formal Definitions and Constructions

The Conjugate Kernel associated to a neural network is defined by propagating input vectors through a sequence of linear and nonlinear maps, capturing the feature covariance at each layer. For a fully-connected network with LL hidden layers, weight matrices WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}, activation σ:RR\sigma:\mathbb{R}\to\mathbb{R} with EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=0, E[σ(ξ)2]=1E[\sigma(\xi)^2]=1, and nn inputs X0Rd0×nX_0\in\mathbb{R}^{d_0\times n}: H=WX1,X=1dσ(H)H_\ell = W_\ell X_{\ell-1},\quad X_\ell = \frac{1}{\sqrt{d_\ell}}\, \sigma(H_\ell) applying σ\sigma columnwise. The Gram (CK) matrix at layer \ell is

WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}0

The output-layer CK is WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}1. In the “skeleton” formalism (Daniely, 2017), the CK is recursively built from a computation skeleton by propagating “conjugate activations”

WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}2

and composition/averaging rules.

In the context of Dirac/Laplacian operators, the “CK-extension” operator is defined (classical setting) as the Bessel-functional extension

WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}3

where WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}4 is the Bessel function, enabling analytic continuation in both ordinary and superspace settings (Adán, 2020).

2. Asymptotic Spectral Laws and Deterministic Equivalents

In the linear-width limit (WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}5, WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}6), the spectrum of the CK converges to deterministic limits described by random matrix theory (Fan et al., 2020, Chouard, 2023). The empirical spectral distribution (ESD) of the CK at each layer is described via the Marčenko–Pastur (MP) map and free convolution:

  • For initial WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}7 with limit law WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}8,
  • Inductively, WRd×d1W_\ell\in \mathbb{R}^{d_\ell\times d_{\ell-1}}9, then

σ:RR\sigma:\mathbb{R}\to\mathbb{R}0

with σ:RR\sigma:\mathbb{R}\to\mathbb{R}1.

A deterministic equivalent (DE) for the Stieltjes transform of σ:RR\sigma:\mathbb{R}\to\mathbb{R}2 can be constructed using recursive free multiplicative convolution: σ:RR\sigma:\mathbb{R}\to\mathbb{R}3 with explicit operator norm bounds and polynomial convergence rates for resolvents and transforms (Chouard, 2023).

This approach remains valid under various model generalizations: inclusion of biases, layerwise activations, and i.i.d. or weakly correlated data inputs.

3. Spectral Phenomena: Bulk, Outliers, and Phase Transitions

The spectrum of the CK is characterized not only by its bulk (support specified by the MP law for light-tailed weights) but also by outlier (spike) eigenvalues linked to data and activation structures (Benigni et al., 2022, Cranston et al., 28 May 2026):

  • Bulk Law: For analytic, centered σ:RR\sigma:\mathbb{R}\to\mathbb{R}4 and i.i.d. σ:RR\sigma:\mathbb{R}\to\mathbb{R}5, the empirical law of CK eigenvalues converges to a compactly supported measure (a rescaled MP law when certain order parameters vanish).
  • Outliers: Outliers (“spikes”) emerge due to higher Hermite coefficients of σ:RR\sigma:\mathbb{R}\to\mathbb{R}6 and non-Gaussian (kurtosis) features of σ:RR\sigma:\mathbb{R}\to\mathbb{R}7, σ:RR\sigma:\mathbb{R}\to\mathbb{R}8. For pure even-monomial activations, an explicit rank-one spiked “information-plus-noise” approximation describes a BBP-type phase transition, with the spike location governed by coupling of σ:RR\sigma:\mathbb{R}\to\mathbb{R}9 and kurtosis.
  • Phase Transitions: The phase separation between bulk/stuck and outlier regimes can be precisely characterized. For example, an outlier separates iff parameter EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=00 (quantifying EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=01, kurtosis, and sample ratios) exceeds an explicit threshold. For general activations, the outlier structure is accurately described by quadratic deterministic-equivalent surrogates (Cranston et al., 28 May 2026).

Spiked deterministic equivalents allow the tracking of eigenvector-label alignment (overlap), providing a sharp tool for predicting nonlinear learnability (e.g., in separating XOR data).

4. Role in Learning and Generalization

CK defines the function class efficiently learnable by gradient-based optimization. Standard stochastic gradient descent (SGD) on randomly initialized deep nets outputs predictors that are competitive (in population loss) with the best function in the RKHS defined by the CK, for depths up to EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=02 and a rich class of architectures (Daniely, 2017).

Key results include:

  • Polynomial-time learning for constant-degree polynomials via SGD on shallow nets with suitable activation.
  • Universal approximation: For arbitrary continuous target functions, sufficiently wide/deep networks, SGD learns to a prescribed accuracy (with more general, non-polynomial dependence on size/steps).
  • Optimization/feature stability: In these regimes, non-final-layer weights remain close to their initializations, keeping the feature map stable and justifying the CK approximation.
  • NTK vs CK: The NTK “full-Jacobian” kernel always subsumes the CK in function space. However, for smooth targets, the performance gap (test loss) between NTK and CK is at most a constant factor. In some regimes, CK generalizes better due to improved conditioning and regularity of features (Qadeer et al., 2023).

5. Heavy-Tailed Regimes and Non-Gaussian Theory

When the weights EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=03 in the CK ensemble are drawn from heavy-tailed distributions (e.g., symmetric EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=04-stable, sparse heavy), the spectral law of CK deviates markedly from the MP paradigm (Guionnet et al., 25 Feb 2025). Higher-order cumulants of EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=05 persist, allowing additional graph structures (“cactus graphs”) in moment expansions. As a consequence, random features extracted via CK exhibit:

  • Clustered correlations across large swaths of the feature space,
  • Spectral mass away from MP bulk,
  • New regimes for generalization (potential explicit tuning via architecture design),
  • Impacts on kernel ridge regression predictions and NTK linearized training

This theory extends the scope of random matrix modeling in modern networks, accurately describing regimes with non-Gaussian heavy-tailed initialization or data.

6. Efficient Computation and Applications in Modern ML

The CK is a computationally efficient surrogate for full NTK-based kernel learning. In finite-width regimes, CK-based regression and classification nearly match NTK performance and sometimes yield superior generalization:

  • Efficient extraction: The CK can be computed by a single forward pass and a small-dimension linear solve with last-layer activations, unlike NTK (requiring full EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=06 Gram assembly).
  • CK-based linear probing: In large-scale transfer settings (e.g., with GPT-2, Falcon models), retraining only the last layer via CK achieves 90–96% of full fine-tuning accuracy at drastically reduced computational/memory cost.
  • Applications: Physics-informed operator learning, convolutional neural network classification, and reweighted loss optimization demonstrate the versatility of CK-based surrogates (Qadeer et al., 2023).

Smooth activations (e.g., Tanh, erf) ensure Lipschitz or monotonic test residuals, maximizing the approximation quality of the CK relative to NTK.

7. Classical and Superspace CK-Extensions

In analysis and mathematical physics, CK-extensions solve systems such as the bi-axial Dirac equation EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=07 with analytic continuation:

  • Classical case: The CK-operator is a Bessel-type power series in EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=08.
  • Superspace: With superdimension EξN(0,1)[σ(ξ)]=0E_{\xi\sim N(0,1)}[\sigma(\xi)]=09, the CK-extension splits (when E[σ(ξ)2]=1E[\sigma(\xi)^2]=10 is a negative even integer) into a finite Appell-polynomial series and an infinite Bessel series, acting on two distinct sets of initial superfunctions.
  • Integral representations: CK-extensions correspond to (normalized) integrals of plane waves over the (super)sphere, establishing connections to the Pizzetti formula for spherical integration and enabling explicit decompositions in supersymmetric PDEs (Adán, 2020).

This generalizes covariance and propagation mechanisms from neural random feature maps to spectral methods in mathematical analysis.


References:

  • "Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks" (Fan et al., 2020)
  • "SGD Learns the Conjugate Kernel Class of the Network" (Daniely, 2017)
  • "Global law of conjugate kernel random matrices with heavy-tailed weights" (Guionnet et al., 25 Feb 2025)
  • "Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks" (Benigni et al., 2022)
  • "Deterministic equivalent of the Conjugate Kernel matrix associated to Artificial Neural Networks" (Chouard, 2023)
  • "Eigen-Spike Emergence and Quadratic Equivalents for Conjugate Kernels on Nonlinearly Separable Data" (Cranston et al., 28 May 2026)
  • "Efficient kernel surrogates for neural network-based regression" (Qadeer et al., 2023)
  • "Generalized Cauchy-Kovalevskaya extension and plane wave decompositions in superspace" (Adán, 2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conjugate Kernel (CK).