Conjugate Kernel: Theory & Applications
- Conjugate Kernel is defined as the covariance (Gram) matrix of neural network feature maps, revealing the structure and expressivity of each layer.
- It leverages random matrix theory to derive deterministic equivalents and spectral laws, uncovering phase transitions and outlier eigenvalues.
- CK offers computational efficiency in kernel learning and has broad applications in neural generalization, operator extensions in PDEs, and superspace analysis.
The Conjugate Kernel (CK) is a central mathematical object in the theory of neural networks, kernel machines, and partial differential equations. In deep learning, CK arises as the covariance (Gram) matrix of the feature map induced by a randomly initialized, feed-forward neural network, capturing the structure and expressivity of the network’s layers. In classical and superspace analysis, the CK-operator encodes Bessel-type functional extensions and plane wave decompositions associated with Dirac and Laplace operators. The modern theory of CK connects random matrix theory, kernel learning, and representations of high-dimensional data, providing key insights into both the spectral properties and learnability of neural architectures.
1. Formal Definitions and Constructions
The Conjugate Kernel associated to a neural network is defined by propagating input vectors through a sequence of linear and nonlinear maps, capturing the feature covariance at each layer. For a fully-connected network with hidden layers, weight matrices , activation with , , and inputs : applying columnwise. The Gram (CK) matrix at layer is
0
The output-layer CK is 1. In the “skeleton” formalism (Daniely, 2017), the CK is recursively built from a computation skeleton by propagating “conjugate activations”
2
and composition/averaging rules.
In the context of Dirac/Laplacian operators, the “CK-extension” operator is defined (classical setting) as the Bessel-functional extension
3
where 4 is the Bessel function, enabling analytic continuation in both ordinary and superspace settings (Adán, 2020).
2. Asymptotic Spectral Laws and Deterministic Equivalents
In the linear-width limit (5, 6), the spectrum of the CK converges to deterministic limits described by random matrix theory (Fan et al., 2020, Chouard, 2023). The empirical spectral distribution (ESD) of the CK at each layer is described via the Marčenko–Pastur (MP) map and free convolution:
- For initial 7 with limit law 8,
- Inductively, 9, then
0
with 1.
A deterministic equivalent (DE) for the Stieltjes transform of 2 can be constructed using recursive free multiplicative convolution: 3 with explicit operator norm bounds and polynomial convergence rates for resolvents and transforms (Chouard, 2023).
This approach remains valid under various model generalizations: inclusion of biases, layerwise activations, and i.i.d. or weakly correlated data inputs.
3. Spectral Phenomena: Bulk, Outliers, and Phase Transitions
The spectrum of the CK is characterized not only by its bulk (support specified by the MP law for light-tailed weights) but also by outlier (spike) eigenvalues linked to data and activation structures (Benigni et al., 2022, Cranston et al., 28 May 2026):
- Bulk Law: For analytic, centered 4 and i.i.d. 5, the empirical law of CK eigenvalues converges to a compactly supported measure (a rescaled MP law when certain order parameters vanish).
- Outliers: Outliers (“spikes”) emerge due to higher Hermite coefficients of 6 and non-Gaussian (kurtosis) features of 7, 8. For pure even-monomial activations, an explicit rank-one spiked “information-plus-noise” approximation describes a BBP-type phase transition, with the spike location governed by coupling of 9 and kurtosis.
- Phase Transitions: The phase separation between bulk/stuck and outlier regimes can be precisely characterized. For example, an outlier separates iff parameter 0 (quantifying 1, kurtosis, and sample ratios) exceeds an explicit threshold. For general activations, the outlier structure is accurately described by quadratic deterministic-equivalent surrogates (Cranston et al., 28 May 2026).
Spiked deterministic equivalents allow the tracking of eigenvector-label alignment (overlap), providing a sharp tool for predicting nonlinear learnability (e.g., in separating XOR data).
4. Role in Learning and Generalization
CK defines the function class efficiently learnable by gradient-based optimization. Standard stochastic gradient descent (SGD) on randomly initialized deep nets outputs predictors that are competitive (in population loss) with the best function in the RKHS defined by the CK, for depths up to 2 and a rich class of architectures (Daniely, 2017).
Key results include:
- Polynomial-time learning for constant-degree polynomials via SGD on shallow nets with suitable activation.
- Universal approximation: For arbitrary continuous target functions, sufficiently wide/deep networks, SGD learns to a prescribed accuracy (with more general, non-polynomial dependence on size/steps).
- Optimization/feature stability: In these regimes, non-final-layer weights remain close to their initializations, keeping the feature map stable and justifying the CK approximation.
- NTK vs CK: The NTK “full-Jacobian” kernel always subsumes the CK in function space. However, for smooth targets, the performance gap (test loss) between NTK and CK is at most a constant factor. In some regimes, CK generalizes better due to improved conditioning and regularity of features (Qadeer et al., 2023).
5. Heavy-Tailed Regimes and Non-Gaussian Theory
When the weights 3 in the CK ensemble are drawn from heavy-tailed distributions (e.g., symmetric 4-stable, sparse heavy), the spectral law of CK deviates markedly from the MP paradigm (Guionnet et al., 25 Feb 2025). Higher-order cumulants of 5 persist, allowing additional graph structures (“cactus graphs”) in moment expansions. As a consequence, random features extracted via CK exhibit:
- Clustered correlations across large swaths of the feature space,
- Spectral mass away from MP bulk,
- New regimes for generalization (potential explicit tuning via architecture design),
- Impacts on kernel ridge regression predictions and NTK linearized training
This theory extends the scope of random matrix modeling in modern networks, accurately describing regimes with non-Gaussian heavy-tailed initialization or data.
6. Efficient Computation and Applications in Modern ML
The CK is a computationally efficient surrogate for full NTK-based kernel learning. In finite-width regimes, CK-based regression and classification nearly match NTK performance and sometimes yield superior generalization:
- Efficient extraction: The CK can be computed by a single forward pass and a small-dimension linear solve with last-layer activations, unlike NTK (requiring full 6 Gram assembly).
- CK-based linear probing: In large-scale transfer settings (e.g., with GPT-2, Falcon models), retraining only the last layer via CK achieves 90–96% of full fine-tuning accuracy at drastically reduced computational/memory cost.
- Applications: Physics-informed operator learning, convolutional neural network classification, and reweighted loss optimization demonstrate the versatility of CK-based surrogates (Qadeer et al., 2023).
Smooth activations (e.g., Tanh, erf) ensure Lipschitz or monotonic test residuals, maximizing the approximation quality of the CK relative to NTK.
7. Classical and Superspace CK-Extensions
In analysis and mathematical physics, CK-extensions solve systems such as the bi-axial Dirac equation 7 with analytic continuation:
- Classical case: The CK-operator is a Bessel-type power series in 8.
- Superspace: With superdimension 9, the CK-extension splits (when 0 is a negative even integer) into a finite Appell-polynomial series and an infinite Bessel series, acting on two distinct sets of initial superfunctions.
- Integral representations: CK-extensions correspond to (normalized) integrals of plane waves over the (super)sphere, establishing connections to the Pizzetti formula for spherical integration and enabling explicit decompositions in supersymmetric PDEs (Adán, 2020).
This generalizes covariance and propagation mechanisms from neural random feature maps to spectral methods in mathematical analysis.
References:
- "Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks" (Fan et al., 2020)
- "SGD Learns the Conjugate Kernel Class of the Network" (Daniely, 2017)
- "Global law of conjugate kernel random matrices with heavy-tailed weights" (Guionnet et al., 25 Feb 2025)
- "Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks" (Benigni et al., 2022)
- "Deterministic equivalent of the Conjugate Kernel matrix associated to Artificial Neural Networks" (Chouard, 2023)
- "Eigen-Spike Emergence and Quadratic Equivalents for Conjugate Kernels on Nonlinearly Separable Data" (Cranston et al., 28 May 2026)
- "Efficient kernel surrogates for neural network-based regression" (Qadeer et al., 2023)
- "Generalized Cauchy-Kovalevskaya extension and plane wave decompositions in superspace" (Adán, 2020)