Learnable Kernel Functions

Updated 17 December 2025

Learnable kernel functions are parameterized PSD kernels optimized from data, enabling the construction of an adaptive reproducing kernel Hilbert space.
They integrate multiple kernel learning, Bayesian spectral methods, and neural parameterizations to yield flexible and scalable modeling.
Their applications span clustering, regression, and domain adaptation, supported by theoretical guarantees of approximation and statistical generalization.

A learnable kernel function is a parameterized positive semidefinite (PSD) kernel whose parameters are adapted during data-driven optimization, inducing an adaptive reproducing kernel Hilbert space (RKHS) rather than relying on a fixed similarity measure such as the classical Gaussian or polynomial kernel. Learnable kernels unify and generalize approaches from multiple kernel learning (MKL), Bayesian nonparametric inference in the kernel space, spectral methods, neural-parameterized similarity functions, and localized/adaptive techniques. Their theoretical and algorithmic development underpins recent advances in flexible, scalable, and domain-adaptive kernel-based learning.

1. Mathematical Formulations of Learnable Kernel Functions

A learnable kernel function is generally specified as a PSD symmetric (or, more generally, possibly asymmetric) function $k_\theta(x, x')$ whose form or parameters $\theta$ are learned from data. Several representative architectures include:

Finite/Infinite Combinations: $k_\theta(x,x') = \sum_{m=1}^M d_m K_m(x,x')$ with $d_m\ge0$ (as in MKL (Dinuzzo, 2010)), or more generally any convex or conic combination of base kernels.
Functional Parametrization: $k_\theta(x,x')$ defined via a feature map $\phi(x;\theta)$ , e.g.\ $k_\theta(x,x') = \langle \phi(x;\theta), \phi(x';\theta)\rangle$ with $\phi$ typically being a neural network or a random-feature model whose weights/activation classes are trainable (Ren et al., 2021, Ma et al., 17 Oct 2025).
Spectral Parameterizations: For stationary kernels on $\mathbb{R}^d$ , $k_\theta(x,x')=\int_\mathbb{R}^d \rho_\theta(\omega) e^{i\omega^\top(x-x')}d\omega$ with a learnable spectral density $\rho_\theta(\omega)$ (Oliva et al., 2015, Yang et al., 2014, Benton et al., 2019).
Structured Forms: Quadratic forms $k_\theta(x,x') = \beta(x)^\top M \beta(x')$ with learnable PSD matrix $M$ (possibly sample-dependent) (Lu et al., 2021), or tessellation-based universal kernel parameterizations $K_P(x,y)$ using positive matrices $P$ (Colbert et al., 2017).
Localized/Adaptive Kernels: $k(x, x') = \sum_{m=1}^M \eta_m(x)\eta_m(x')K_m(x,x')$ with local weights $\eta_m(x)$ (Moeller et al., 2016), or locally-adaptive bandwidth (LAB) kernels with trainable, data-dependent kernel widths (He et al., 2023).
Neural/Transformer Kernels: Explicit learnable similarity measures in architectures such as ReBased (parameterized, normalized quadratic feature maps), replacing fixed dot-product attention (Aksenov et al., 16 Feb 2024).

In all settings, the learnable nature of the kernel derives from the (often high-dimensional, possibly non-convex) optimization over $\theta$ jointly with supervised or unsupervised objectives, such as SVM margins, GP marginal likelihoods, reconstruction loss in autoencoders, domain adaptation metrics, or neural cross-entropy.

2. Major Architectures and Algorithmic Strategies

The primary classes of learnable kernel frameworks can be distinguished as follows:

Multiple Kernel Learning (MKL) and Two-Layer Kernels: The objective is joint optimization over convex combinations of basis kernels. The two-layer RKHS representer theorem guarantees that the learned function remains a finite expansion in the training points, with the kernel parameters corresponding to secondary weights. The RLS2 algorithm realizes efficient block-coordinate solutions for regression/classification, with feature selection arising in the linear base-kernel case (Dinuzzo, 2010).
Spectral Learning and Bayesian Nonparametrics: Approaches such as BaNK (Oliva et al., 2015) and Functional Kernel Learning (Benton et al., 2019) exploit Bochner’s theorem to place nonparametric (mixture or GP) priors over the spectral domain, allowing the kernel’s spectrum to adaptively match empirical statistics. Inference leverages random Fourier features, MCMC/variational updates, and slice sampling.
Localized and Sample-Dependent Kernels: LKL (Moeller et al., 2016) allows kernel weights to be functions of the input, optimized using alternating block-coordinate descent. LAB RBF kernels introduce per-sample bandwidths (possibly per-dimension) with optimization via KRR-like closed forms and outer SGD or dynamic support selection (He et al., 2023). PDQK (Lu et al., 2021) enables general PSD quadratic kernel parameterization for domain adaptation, optimizing over the symmetric matrix using Riemannian methods.
Nonlinear/Neural Parameterizations: DAE-PCA (Ren et al., 2021) interprets the encoder of a deep autoencoder as an explicit data-dependent feature map $\phi(x)$ , with the kernel defined by learned inner products. In transformers, attend-to-all mechanisms have been replaced by trainable kernel similarity functions, e.g., the ReBased quadratic parameterization with per-feature scale and shift (Aksenov et al., 16 Feb 2024).
Universal Kernel Parameterizations: Tessellated Kernel (TK) class provides a universal, convexly parameterized family, with the kernel space linearly indexed by positive matrices. TK kernels can approximate any PSD kernel on finite data to arbitrary precision, offering dense coverage and tractable SDP or MKL-style optimization (Colbert et al., 2017).

3. Theoretical Properties and Guarantees

Positive Definiteness and Universality: Across architectures, PSD (or, for recent asymmetric extensions, certain relaxed conditions) is enforced via construction: inner products in Hilbert spaces, explicit PSD matrix constraints, nonparametric spectral densities, or composition of known PSD kernels.
Approximation Power and Density: Universal kernels such as the TK family or sufficiently rich spectral mixtures can approximate arbitrary continuous kernels or their associated RKHS's on compact domains (Colbert et al., 2017, Kothari et al., 2019). Density guarantees also arise in functional kernel learning via sufficiently rich priors (Benton et al., 2019).
Statistical Generalization: Rademacher complexity and spectral decay analyses provide feature-complexity bounds. In random-feature frameworks with trained activation functions, data-dependent (leverage-score) sampling strategies reduce the number of required features for a target error from $\Omega(1/\epsilon^2)$ to $\tilde O((1/\epsilon)^{1/t})$ when the kernel spectrum decays as $i^{-t}$ (Ma et al., 17 Oct 2025).
Convexity: Many learnable kernel settings admit convex or block-convex optimization, e.g., convexity in MKL simplex weights or positive matrices, or alternating convexity in sample-dependent/lab adaptive bandwidths (Dinuzzo, 2010, Moeller et al., 2016, He et al., 2023, Lu et al., 2021, Colbert et al., 2017).

4. Algorithmic and Computational Considerations

Approach/Class	Optimization Complexity	Scalability/Remarks
Two-layer RKHS/MKL	$O(m^3)$ (SDP) / $O(\ell^3+m\ell^2)$	Efficient for $m$ small or $d$ sparse; block-coordinate
Spectral GP/BaNK	$O(N)$ – $O(M^2N)$	Random features for large $N$ , scalable inference
Random-feature models	$O(ms)$ or $O(m\log d)$	Fastfood expansions for $d\gg1$ , leverage scoring for $s$
LAB, PDQK, LKL	$O(N_{sv}^3)$ inner/SGD outer	Dynamic support, online update, local adaptivity
Transformer kernels	$O(N)$ per sequence	Linear attention via learned kernel similarity
DAE-PCA	$O(Nmd)$ offline, $O(md)$ online	Drastic online speedup over classical KPCA
TK (SDP/MKL)	$O(m^{2.6}+q^{1.9})$ / $O(m^{2.1}+L^{1.6})$	Convex programs, scalable with SMKL-TK variant

Techniques such as landmark selection, dynamic support augmentation, Kronecker/Tensor product acceleration, and efficient eigendecomposition underpin computational tractability for large data.

5. Representative Applications and Empirical Results

Clustering and Ordinal Embedding: Triplet-based learnable kernels recover cluster geometry with reduced sample requirement and enable efficient application of kernel PCA, SVM, and $k$ -means (Kleindessner et al., 2016).
Domain Adaptation: Sample-dependent learnable quadratic kernels (PDQK) minimize mean discrepancy, yielding significant accuracy gains over fixed-kernel DAL baselines across image, text, and digit-recognition settings (Lu et al., 2021).
Regression/Classification: BaNK and fast kernel learning frameworks outperform classical kernel regression and MKL on UCI benchmarks with up to $N\sim10^6$ , matching or exceeding neural network accuracy, with fully Bayesian posterior inference and uncertainty quantification (Oliva et al., 2015, Yang et al., 2014, Benton et al., 2019, He et al., 2023).
High-dimensional Feature Selection: RLS2 and linear DAE-PCA enable embedded variable selection and rapid nonlinear fault detection, with 100-fold speed improvements in online deployment (Dinuzzo, 2010, Ren et al., 2021).
Neural Sequence Models: Learnable transformer kernels employing adaptive quadratic maps close much of the performance gap with dot-product attention on in-context learning and language modeling benchmarks (e.g., the Pile), at linear time complexity (Aksenov et al., 16 Feb 2024).
Universal Approximation and SVM Classification: TK kernels obtain or exceed MKL and neural net accuracy on benchmark UCI classification datasets, especially in high sample-to-feature regimes (Colbert et al., 2017).

Empirical studies repeatedly demonstrate (i) improved data fit due to adaptive kernel geometry, (ii) robustness to over/under-smoothing compared to fixed kernels, and (iii) competitive or superior accuracy versus both classical and deep learning baselines.

6. Limitations, Trade-Offs, and Open Directions

Computational Overhead: While random-feature and Fastfood-based models scale to large $N$ and $d$ , SDP and dense-parameter universal kernel approaches can incur significant cost for very large or high-dimensional inputs, necessitating approximate methods or structural constraints (Yang et al., 2014, Colbert et al., 2017).
Hyperparameter Selection: Many learnable kernel systems require nontrivial choices of support size, basis number, bandwidth initialization, or spectral grouping strategy; meta-optimization or automated model selection remains active research.
Nonstationary and Asymmetric Extensions: Stationary kernel frameworks dominate, but locally-adaptive, sample-dependent, and asymmetric kernels have gained prominence for domain adaptation and flexible generalization (He et al., 2023, Lu et al., 2021).
Theoretical Expressiveness: While universal approximation is established for classes such as TK and functional kernel learning, lower bounds expose limitations for certain high-complexity Boolean functions unless model size or RKHS norm becomes exponential (Kothari et al., 2019).
Interpretability and Domain Knowledge: Parametric forms and Bayesian posteriors enable interpretable priors (e.g., spectra), but high-depth neural and nonparametric models can become opaque, paralleling issues in deep learning.
Deployment Considerations: Hardware-realizable kernels, as in FRI adaptive-kernel learning, show promise for efficient edge deployment when kernels are parametrized as low-order exponentials (Nitsure et al., 28 Sep 2025).

7. Unifying Principles and Conceptual Significance

Learnable kernel functions operationalize the notion that the hypothesis space—the geometry and smoothness of functions permitted—is itself a subject of optimization. By casting the kernel as a parameterized, data-driven object, these methods generalize the classical kernel trick, constructively realize universal approximation, unify parametric and nonparametric methodologies, and adapt to domain heterogeneity, task structure, and computational constraints. The ongoing evolution of this field links traditional statistical learning, functional analysis, nonparametric Bayes, modern neural architectures, and real-world system deployment, continually expanding the expressive capacity and practical impact of kernel methods in contemporary machine learning.