Movable Kernels in Adaptive Kernel Methods
- Movable kernels are parameter-dependent methods that adapt both the basis functions and coefficients, enabling efficient, tailored RKHS constructions.
- They employ a family of RKHSs with learnable parameters to jointly minimize supervised loss, improving approximation error and reducing computational costs.
- Empirical evaluations show that adaptive movable kernels yield higher accuracy and model compactness compared to traditional fixed-kernel approaches.
Movable kernels, also termed parameter-dependent kernels, refer to a class of kernel methods in which both the solution space and its underlying reproducing kernel are determined not only by the dataset and a fixed ambient RKHS, but also by a set of learnable, dataset-independent parameters. This framework, formalized in recent work on adaptive kernel methods, generalizes traditional fixed-kernel models by introducing families of RKHSs indexed by a parameter vector. The resulting models exhibit enhanced flexibility, scalability, and performance relative to classical kernel approximations, enabling joint learning of both basis and coefficients—a formulation that unifies and significantly extends earlier approaches such as Random Fourier Features (Dózsa et al., 29 Jan 2026).
1. Definition and Mathematical Framework
A movable kernel is defined via a family of RKHSs , each specified by a parameter vector . The reproducing kernels in this family take the form
where is a parameterized feature map, often constructed such that the RKHS is -dimensional with (number of samples). Alternatively, one specifies an orthonormal basis , yielding
This construction allows both the feature map and, hence, the solution space to be adapted through optimization over , rather than being tied to a single choice of basis or feature space (Dózsa et al., 29 Jan 2026).
2. Adaptive Projection and Joint Optimization
Given a dataset and a loss functional
the projection operator in adaptive kernel methods is
which expands as . Via the feature-map representation, this becomes , with . The model thus decouples the coefficient optimization (over ) from basis selection (over ), resulting in the joint minimization problem: For each fixed , the inner minimization is a standard supervised problem in ; subsequently, is updated using gradients back-propagated through . This structure generalizes the variable-projection framework [Golub–Pereyra] and enables direct learning of feature spaces tailored to the prediction task (Dózsa et al., 29 Jan 2026).
3. Kernel Approximation in Infinite-Dimensional Settings
Many kernels of practical relevance—such as the Gaussian RBF kernel—are associated with infinite-dimensional RKHSs. Classical finite-basis approximations, such as those derived from Mercer's theorem or Random Fourier Features (RFF), yield truncated expansions
with the approximation error decaying as a function of the tail of the basis. Movable kernels extend this concept by parameterizing the basis itself: one optimizes over to directly minimize the supervised loss, rather than minimizing intermediate approximation error in the feature space. For example, in the Hardy space , the Cauchy kernel is approximated via a Takenaka–Malmquist (TM) basis indexed by a movable pole : with explicit error bounds as a function of and . By learning the pole parameters via empirical loss minimization, the resulting model spaces can exactly capture rational transfer functions, yielding interpretable and parsimonious representations (Dózsa et al., 29 Jan 2026).
4. Generalization of Random Fourier Features (RFF)
Random Fourier Features approximate shift-invariant kernels via Monte Carlo sampling of frequencies : Movable kernels interpret RFF as the special case where the parameter vector is randomly chosen. By making these frequencies learnable and jointly optimizing them with model coefficients, movable kernel methods can significantly outperform classic RFF both in predictive accuracy and compactness. Empirical results show that this joint optimization delivers substantial accuracy improvements with only marginal additional computational overhead compared to fixed-feature RFF (Dózsa et al., 29 Jan 2026).
5. Training Algorithms and Computational Complexity
Gradient-based optimization is employed for joint learning of and , as detailed in the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Inputs:
S = {(t_k, y_k)}_{k=1}^q, feature-order D,
initial θ⁽⁰⁾, w⁽⁰⁾ ∈ ℂ^D, step-sizes η_θ, η_w, tol ε.
for i = 0, 1, 2, … until convergence do
1) Build basis {ϕ_j^{θ⁽ᶦ⁾}} or feature map Φ_{θ⁽ᶦ⁾}.
2) Compute f^{(i)}(x) = ⟨Φ_{θ⁽ᶦ⁾}(x), w⁽ᶦ⁾⟩.
3) Evaluate loss E_i = E((t_k, f^{(i)}(t_k), y_k)).
4) Compute gradients
g_w = ∂E_i/∂w,
g_θ = ∂E_i/∂θ
(back-propagate through Φ_θ if needed).
5) Update
w^(i+1) = w^(i) - η_w * g_w,
θ^(i+1) = θ^(i) - η_θ * g_θ.
6) Stop if ||g_w|| + ||g_θ|| < ε.
end
Output θ*, w*. |
The dominant per-iteration cost is (building and processing -dimensional features for samples) plus an gradient computation. If is computed directly (e.g., via least squares), the additional cost is . Inference on a new point requires operations, which is a significant reduction compared to the cost of classical full-rank kernel models (Dózsa et al., 29 Jan 2026).
6. Empirical Evaluation and Applications
Movable kernel methods have demonstrated high efficacy in diverse numerical experiments. In SISO LTI system identification, using a TM basis of with learned movable poles, the adaptive model was able to recover both the poles and coefficients of the true transfer function exactly (zero error). In contrast, RFF-based identification with or $400$ frequencies resulted in persistently large errors and learned frequencies unrelated to the system poles.
For large-scale classification (e.g., ForestCover with , ), adaptive trigonometric and “arctan” bases outperformed standard RFF in both compactness and accuracy. Test-set accuracy improved from approximately 81% using RFF with to about 92% for the adaptive trigonometric basis of the same size; adaptive trigonometric basis reached 80% accuracy already at , whereas RFF required . Training times for adaptive methods were comparable or lower, with faster convergence in terms of epochs (Dózsa et al., 29 Jan 2026).
7. Connections, Generalizations, and Significance
Movable kernels unify and extend classical fixed-kernel approaches, including all finite-basis methods such as RFF, by enabling the simultaneous learning of model coefficients and basis functions. This greater flexibility yields models that are more compact, interpretable, and suited to large-scale data, while achieving higher predictive performance. Empirical results indicate dramatic reductions in both model complexity and computational costs. The method provides an efficient alternative to traditional infinite-dimensional kernel approximations and offers theoretical connections to adaptive signal representations and model spaces in system theory (Dózsa et al., 29 Jan 2026).