Papers
Topics
Authors
Recent
Search
2000 character limit reached

Movable Kernels in Adaptive Kernel Methods

Updated 5 February 2026
  • Movable kernels are parameter-dependent methods that adapt both the basis functions and coefficients, enabling efficient, tailored RKHS constructions.
  • They employ a family of RKHSs with learnable parameters to jointly minimize supervised loss, improving approximation error and reducing computational costs.
  • Empirical evaluations show that adaptive movable kernels yield higher accuracy and model compactness compared to traditional fixed-kernel approaches.

Movable kernels, also termed parameter-dependent kernels, refer to a class of kernel methods in which both the solution space and its underlying reproducing kernel are determined not only by the dataset and a fixed ambient RKHS, but also by a set of learnable, dataset-independent parameters. This framework, formalized in recent work on adaptive kernel methods, generalizes traditional fixed-kernel models by introducing families of RKHSs indexed by a parameter vector. The resulting models exhibit enhanced flexibility, scalability, and performance relative to classical kernel approximations, enabling joint learning of both basis and coefficients—a formulation that unifies and significantly extends earlier approaches such as Random Fourier Features (Dózsa et al., 29 Jan 2026).

1. Definition and Mathematical Framework

A movable kernel is defined via a family of RKHSs {H(θ)}θΘ\{\mathcal{H}(\theta)\}_{\theta\in\Theta}, each specified by a parameter vector θΘRp\theta\in\Theta\subset\mathbb{R}^p. The reproducing kernels in this family take the form

kθ(x,x)=Φθ(x),Φθ(x)CDk_\theta(x, x') = \langle \Phi_\theta(x), \Phi_\theta(x') \rangle_{\mathbb{C}^D}

where Φθ:XCD\Phi_\theta: \mathcal{X} \to \mathbb{C}^D is a parameterized feature map, often constructed such that the RKHS H(θ)\mathcal{H}(\theta) is DD-dimensional with DqD \ll q (number of samples). Alternatively, one specifies an orthonormal basis {φjθ}j=0D1\{\varphi_j^\theta\}_{j=0}^{D-1}, yielding

kθ(x,x)=j=0D1φjθ(x) φjθ(x).k_\theta(x, x') = \sum_{j=0}^{D-1} \varphi_j^\theta(x)\ \overline{\varphi_j^\theta(x')}.

This construction allows both the feature map and, hence, the solution space to be adapted through optimization over θ\theta, rather than being tied to a single choice of basis or feature space (Dózsa et al., 29 Jan 2026).

2. Adaptive Projection and Joint Optimization

Given a dataset S={(tk,yk)}k=1qS = \{(t_k, y_k)\}_{k=1}^q and a loss functional

E((t1,f(t1),y1),,(tq,f(tq),yq)),E((t_1, f(t_1), y_1), \ldots, (t_q, f(t_q), y_q)),

the projection operator in adaptive kernel methods is

Pθ,SF:=argmingH(θ)E((tk,g(tk),yk)),P_{\theta, S}F := \arg\min_{g \in \mathcal{H}(\theta)} E((t_k, g(t_k), y_k)),

which expands as g(x)=k=1qckkθ(x,tk)g(x) = \sum_{k=1}^{q} c_k k_\theta(x, t_k). Via the feature-map representation, this becomes fθ(x)=Φθ(x),wf_\theta(x) = \langle \Phi_\theta(x), w \rangle, with wCDw \in \mathbb{C}^D. The model thus decouples the coefficient optimization (over ww) from basis selection (over θ\theta), resulting in the joint minimization problem: minθΘminwCDE((tk,fθ(w;tk),yk)).\min_{\theta\in\Theta} \min_{w\in\mathbb{C}^D} E((t_k, f^{\theta}(w; t_k), y_k)). For each fixed θ\theta, the inner minimization is a standard supervised problem in H(θ)\mathcal{H}(\theta); subsequently, θ\theta is updated using gradients back-propagated through wθw^*_\theta. This structure generalizes the variable-projection framework [Golub–Pereyra] and enables direct learning of feature spaces tailored to the prediction task (Dózsa et al., 29 Jan 2026).

3. Kernel Approximation in Infinite-Dimensional Settings

Many kernels of practical relevance—such as the Gaussian RBF kernel—are associated with infinite-dimensional RKHSs. Classical finite-basis approximations, such as those derived from Mercer's theorem or Random Fourier Features (RFF), yield truncated expansions

ξD(x,t)=j=0D1φj(x)φj(t),\xi_D(x, t) = \sum_{j=0}^{D-1} \varphi_j(x) \overline{\varphi_j(t)},

with the approximation error decaying as a function of the tail of the basis. Movable kernels extend this concept by parameterizing the basis itself: one optimizes over θ\theta to directly minimize the supervised loss, rather than minimizing intermediate approximation error in the feature space. For example, in the Hardy space H2(D)H_2(\mathbb{D}), the Cauchy kernel ξ(z,t)=1/(1tz)\xi(z, t) = 1/(1 - \overline{t} z) is approximated via a Takenaka–Malmquist (TM) basis indexed by a movable pole aDa \in \mathbb{D}: ξDa(z,t)=j=0D1Lja(z) Lja(t),\xi_D^a(z, t) = \sum_{j=0}^{D-1} L_j^a(z)\ \overline{L_j^a(t)}, with explicit error bounds as a function of a|a| and DD. By learning the pole parameters via empirical loss minimization, the resulting model spaces can exactly capture rational transfer functions, yielding interpretable and parsimonious representations (Dózsa et al., 29 Jan 2026).

4. Generalization of Random Fourier Features (RFF)

Random Fourier Features approximate shift-invariant kernels ξ(xt)\xi(x-t) via Monte Carlo sampling of frequencies {ωj}j=0D1\{\omega_j\}_{j=0}^{D-1}: ξD(x,t)=Φω(x),Φω(t),Φω(x)=[eiωjx]j=0,,D1.\xi_D(x, t) = \langle \Phi_\omega(x), \Phi_\omega(t) \rangle, \quad \Phi_\omega(x) = [e^{i \omega_j \cdot x}]_{j=0, \ldots, D-1}. Movable kernels interpret RFF as the special case where the parameter vector θ=(ω0,,ωD1)\theta = (\omega_0, \dots, \omega_{D-1}) is randomly chosen. By making these frequencies learnable and jointly optimizing them with model coefficients, movable kernel methods can significantly outperform classic RFF both in predictive accuracy and compactness. Empirical results show that this joint optimization delivers substantial accuracy improvements with only marginal additional computational overhead compared to fixed-feature RFF (Dózsa et al., 29 Jan 2026).

5. Training Algorithms and Computational Complexity

Gradient-based optimization is employed for joint learning of θ\theta and ww, as detailed in the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Inputs: 
  S = {(t_k, y_k)}_{k=1}^q, feature-order D,
  initial θ⁽⁰⁾, w⁽⁰⁾ ∈ ℂ^D, step-sizes η_θ, η_w, tol ε.
for i = 0, 1, 2, … until convergence do
  1) Build basis {ϕ_j^{θ⁽ᶦ⁾}} or feature map Φ_{θ⁽ᶦ⁾}.
  2) Compute f^{(i)}(x) = ⟨Φ_{θ⁽ᶦ⁾}(x), w⁽ᶦ⁾⟩.
  3) Evaluate loss E_i = E((t_k, f^{(i)}(t_k), y_k)).
  4) Compute gradients 
       g_w = ∂E_i/∂w, 
       g_θ = ∂E_i/∂θ 
     (back-propagate through Φ_θ if needed).
  5) Update 
       w^(i+1) = w^(i) - η_w * g_w,
       θ^(i+1) = θ^(i) - η_θ * g_θ.
  6) Stop if ||g_w|| + ||g_θ|| < ε.
end
Output θ*, w*.

The dominant per-iteration cost is O(qD)O(qD) (building and processing DD-dimensional features for qq samples) plus an O(qD)O(qD) gradient computation. If ww is computed directly (e.g., via least squares), the additional cost is O(qD2+D3)O(qD^2 + D^3). Inference on a new point requires O(D)O(D) operations, which is a significant reduction compared to the O(q)O(q) cost of classical full-rank kernel models (Dózsa et al., 29 Jan 2026).

6. Empirical Evaluation and Applications

Movable kernel methods have demonstrated high efficacy in diverse numerical experiments. In SISO LTI system identification, using a TM basis of D=4D = 4 with learned movable poles, the adaptive model was able to recover both the poles and coefficients of the true transfer function exactly (zero error). In contrast, RFF-based identification with Q=200Q = 200 or $400$ frequencies resulted in persistently large errors and learned frequencies unrelated to the system poles.

For large-scale classification (e.g., ForestCover with q5.8×105q \approx 5.8 \times 10^5, p=54p = 54), adaptive trigonometric and “arctan” bases outperformed standard RFF in both compactness and accuracy. Test-set accuracy improved from approximately 81% using RFF with D=5000D = 5000 to about 92% for the adaptive trigonometric basis of the same size; adaptive trigonometric basis reached 80% accuracy already at D=1000D = 1000, whereas RFF required D=5000D = 5000. Training times for adaptive methods were comparable or lower, with faster convergence in terms of epochs (Dózsa et al., 29 Jan 2026).

7. Connections, Generalizations, and Significance

Movable kernels unify and extend classical fixed-kernel approaches, including all finite-basis methods such as RFF, by enabling the simultaneous learning of model coefficients and basis functions. This greater flexibility yields models that are more compact, interpretable, and suited to large-scale data, while achieving higher predictive performance. Empirical results indicate dramatic reductions in both model complexity and computational costs. The method provides an efficient alternative to traditional infinite-dimensional kernel approximations and offers theoretical connections to adaptive signal representations and model spaces in system theory (Dózsa et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Movable Kernels.