Iteratively reweighted kernel machines efficiently learn sparse functions (2505.08277v1)
Abstract: The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influential coordinates with low sample complexity. Moreover, by iteratively using the derivatives to reweight the data and retrain kernel machines, one is able to efficiently learn hierarchical polynomials with finite leap complexity. Numerical experiments illustrate the developed theory.
Summary
- The paper introduces IRKM, showing that iterative gradient reweighting in kernel methods efficiently identifies influential features for sparse function learning.
- It provides theoretical guarantees linking empirical squared gradients to orthogonal polynomial expansions, ensuring consistent identification of key coordinates.
- Experimental results demonstrate that IRKM outperforms standard KRR and neural networks on both synthetic and real-world datasets, confirming its practical effectiveness.
This paper, "Iteratively reweighted kernel machines efficiently learn sparse functions" (2505.08277), challenges the notion that feature learning and hierarchical learning are exclusive to neural networks. It proposes that classical kernel methods, when combined with an iterative reweighting scheme inspired by Iteratively Reweighted Least Squares (IRLS) and Recursive Feature Machines (RFM), can effectively elicit these phenomena and efficiently learn sparse and hierarchical functions.
Motivation:
Deep learning has achieved remarkable success, often attributed to its ability to learn relevant features and hierarchical structure directly from data. However, training deep neural networks can be computationally expensive and brittle, requiring careful tuning of numerous hyperparameters. This motivates the exploration of simpler, more interpretable models that might possess similar learning capabilities. Kernel Ridge Regression (KRR) is a powerful classical method, but standard KRR is known to struggle with high-dimensional data, especially when the true function is sparse or low-dimensional, as its generalization error doesn't typically adapt to the intrinsic complexity of the target function.
Proposed Method: Iteratively Reweighted Kernel Machines (IRKM)
The core contribution is the Iteratively Reweighted Kernel Machines (IRKM) algorithm (Algorithm 2 in the paper). It operates iteratively, aiming to identify and prioritize influential coordinates (features) for the learning task.
The algorithm proceeds as follows:
- Initialization: Start with a uniform weight vector
w
(e.g., all ones). - Iterative Refinement (for T steps):
- Data Sampling: Sample a fresh dataset {(x(i),yi)}i=1n. The use of fresh data at each step is highlighted as important.
- Reweighted Kernel: Define a kernel Kw(x,z)=K(w⊙x,w⊙z). This effectively reweights the dimensions of the input data based on the current weight vector
w
. - Kernel Ridge Regression: Train a KRR model using the sampled data and the reweighted kernel Kw. This yields a predictor f^t.
1 2
beta = (K_w(X, X) + lambda * I)^(-1) * y hat_f_t(z) = K_w(z, X) * beta
- Weight Update: Calculate the empirical squared gradient norm for each coordinate based on the current predictor f^t: En(∂rf^t)2=n1i=1∑n(∂rf^t(x(i)))2. Update the weight vector by adding these squared gradients to a base vector (e.g., the vector of ones), including a small safeguard ϵ>0. The safeguard prevents weights from becoming exactly zero, allowing recovery from early misidentification of non-influential features.
1 2 3
# Assuming grad_f_t_samples is an n x d matrix of gradients empirical_grad_sq = (1/n) * sum(grad_f_t_samples .^ 2, dim=1) # element-wise square, then sum over samples w = ones(d) + empirical_grad_sq # safeguard and base
- Normalization: Normalize the weight vector
w
(e.g., to have an L1 norm equal to the dimensiond
).1
w = (d / sum(w)) * w
This iterative process leverages the intuition that the gradient of the prediction function should be large for features important to the target function. By reweighting dimensions based on gradient magnitudes, the algorithm effectively focuses subsequent learning steps on the relevant features.
Theoretical Contributions:
The paper provides theoretical guarantees, primarily for data sampled from the Gaussian distribution or uniformly from the hypercube, focusing on learning polynomials.
- Gradient Consistency for Feature Learning (Theorem 1 & 2):
- Even standard KRR (the first step of IRKM) can identify influential coordinates (those with E[∂rf∗]2>0), even if its overall generalization error for the target function f∗ is large.
- Specifically, the empirical squared gradient En(∂rf^)2 consistently estimates the expected squared gradient of a truncated version of the target function, E[∂rf≤p∗]2, where p is related to the sample size n via n≈dp+δ.
- Theorem 2 for hypercube data shows a more nuanced picture: En(∂rf^)2≈E[∂rf≤p∗]2+c⋅d2δ−2E[∂rfp+1∗]2. This implies that in the "larger oversampling" regime (δ>1/2), KRR can start detecting the next order of important features (fp+1∗) even if the lower order (f≤p∗) doesn't depend on them. This phenomenon underlies the hierarchical learning capability.
- Efficient Hierarchical Learning via Leap Complexity (Theorem 4):
- The paper introduces the concept of "leap complexity" (Definition 5), which quantifies how features in a function's orthogonal polynomial expansion build upon each other. A low leap complexity means features can be ordered such that later features add only a few new dimensions compared to previous ones (like x1+x1x2+x1x2x3 has leap complexity 1).
- Theorem 4 states that IRKM can learn sparse polynomials with leap complexity k efficiently. With n=dk+δ samples (for small δ∈(−1/2,1/2)∖{0}), IRKM requires only a constant number of iterations (T=Od(1)) to learn the target function f∗ up to its maximal p-leap component, where p is related to the sample size exponent.
- This demonstrates that IRKM leverages the hierarchical structure encoded by leap complexity. The iterative reweighting allows the algorithm to gradually identify features involved in higher-order interactions, enabling learning beyond what standard KRR can achieve at a given sample size. Theorem 3 is a special case where the learning happens in just two steps.
Relation to Other Methods:
- IRLS: If the kernel is linear and resampling/normalization are omitted, IRKM reduces to IRLS for sparse vector recovery. IRKM extends this concept to nonlinear function learning in kernel spaces.
- RFM: IRKM is a diagonal variant of the Recursive Feature Machines (RFM) algorithm [rfm_science], which learns features by estimating the Average Gradient Outer Product (AGOP) matrix. IRKM uses only the diagonal of this matrix (corresponding to coordinate-wise gradient magnitudes). The paper also proposes an "Online RFM" (Algorithm 3) which incorporates fresh sampling and normalization of the AGOP matrix, showing it can learn sparse functions in an unknown basis (e.g., a randomly rotated one) in experiments.
Practical Implementation Considerations:
- Computational Cost: Each step of IRKM involves training a KRR model, which typically requires solving a linear system of size n×n. The cost is O(n3) naively, or O(n2) or faster with specialized solvers or kernel structures. The total cost depends on the number of iterations T, which is theoretically Od(1) but might be larger in practice depending on the task and desired accuracy. The need for fresh data at each step means T×n total samples are needed, or T datasets of size n.
- Hyperparameters: IRKM introduces new hyperparameters: the number of iterations T, the kernel type and its parameters, the ridge regularization λ, and the safeguard ϵ. Cross-validation or heuristic tuning is required. The paper suggests T=Od(1) is sufficient theoretically, and experiments use small T (e.g., 5-10).
- Kernel Choice: The theory relies on specific kernel properties (Assumptions 1 & 2). Inner-product kernels like polynomial and exponential kernels are suitable. Gaussian and Laplacian kernels work well experimentally on data with roughly constant norm.
- Data Distribution: Theoretical guarantees require strong assumptions (Gaussian, Hypercube). However, experiments suggest the method performs well on other datasets like CIFAR-10 (subset of pixels) and UCI tabular data, indicating potential broader applicability.
- Sparsity/Hierarchy: The method is specifically designed for functions that are sparse or have low leap complexity. Its performance advantage is most pronounced in these settings compared to standard KRR or NNs in certain sample size regimes.
Numerical Results Summary:
The paper includes several experiments demonstrating IRKM's practical effectiveness:
- Synthetic Data (Hypercube/Gaussian): IRKM significantly outperforms standard KRR and Adam-trained neural networks (often widely tuned) on learning sparse polynomials (low intrinsic complexity) and hierarchical polynomials (leap complexity 1 and 2) across various sample sizes. The gradient-based reweighting clearly helps in identifying important features.
- CIFAR-10 (Pixels): Using a subset of pixels, IRKM and Online RFM perform comparably to or slightly better than tuned NNs.
- UCI Tabular Data: Benchmarked on 121 classification tasks, IRKM shows performance comparable to RFM and significantly better than Random Forests and Neural Networks, suggesting that sparsity is a key characteristic of functions learned on these datasets.
Conclusion:
The paper successfully demonstrates that classical kernel methods, augmented with iterative gradient-based reweighting, normalization, and fresh data sampling, can achieve strong feature learning and hierarchical learning capabilities. This challenges the common narrative that these are unique strengths of neural networks and positions iteratively reweighted kernel machines as powerful contenders, particularly effective for learning sparse functions and functions with low leap complexity. The theoretical analysis provides insights into why the method works by connecting empirical gradients to properties of orthogonal polynomial expansions. While the theory has specific distributional assumptions, the experimental results suggest the practical utility might extend beyond these assumptions.