Recursive Feature Machines (RFMs)

Updated 23 October 2025

Recursive Feature Machines (RFMs) are algorithms that recursively update feature representations using gradient and risk measures to enhance model performance.
They employ mechanisms like Risk-RFE and AGOP to iteratively eliminate or adapt features, ensuring improved scalability and interpretability.
RFMs find applications in kernel machines, tabular data segmentation, molecular modeling, and controlled generative tasks such as music production.

Recursive Feature Machines (RFMs) are a class of algorithms that implement feature learning by recursively updating feature representations using principled, mathematically transparent mechanisms tied to gradient information or empirical risk changes. RFMs have emerged in multiple contexts across machine learning, including kernel machines, neural networks, code decoding, conformal learning, low-rank recovery, interpretable modeling, and controlled generation. The unifying principle is the iterative adjustment or elimination of features based on their impact measured by either gradients, empirical risk, or non-conformity scores.

1. Core Mechanisms and Mathematical Foundations

RFMs are built upon iterative procedures that optimize feature sets or reweight feature directions to enhance model prediction or representation. There are two major classes of RFM mechanisms:

Recursive Feature Elimination by Empirical Risk ("Risk-RFE"):

The earliest RFM approach (Dasgupta et al., 2013) operates in the context of kernel machines. At each step, features are removed one at a time based on their marginal impact on the regularized empirical risk:

$\mathcal{R}^{\text{reg},\lambda}_{L,D,H}(f) = \lambda \|f\|_H^2 + \mathcal{R}_{L,D}(f)$

Here, $\mathcal{R}_{L,D}(f)$ is the empirical risk, $\lambda$ is a regularization parameter, and $H$ is an RKHS induced by the kernel $k$ . For each feature $i$ , a projected RKHS $H^{J \cup \{i\}}$ is formed by zeroing out feature $i$ , and the difference in empirical risk is

$\Delta_i = \mathcal{R}^{\text{reg},\lambda}_{L,D,H^{J \cup \{i\}}}(f_{D,\lambda,H^{J \cup \{i\}}}) - \mathcal{R}^{\text{reg},\lambda}_{L,D,H^J}(f_{D,\lambda,H^J})$

The feature $i$ with the smallest $\Delta_i$ is eliminated, and the process continues until a stopping threshold $\delta_n$ is reached.

Recursive Feature Learning using the Average Gradient Outer Product (AGOP):

In modern RFMs (Radhakrishnan et al., 2022), feature adaptation is performed by recursively updating a learnable metric $M$ in the kernel via AGOP:

$M \leftarrow \frac{1}{n} \sum_{p=1}^n [\nabla f(x_p)][\nabla f(x_p)]^T$

The corresponding kernel is a Mahalanobis kernel (e.g., Laplace kernel):

$K_M(x, z) = \exp(-\gamma (x-z)^T M (x-z))$

The AGOP highlights directions in input space that most affect predictions, thus continually adapting the kernel (or metric) in feature space.

Algorithmic Formulation and Extensions:
- For matrix recovery and sparse regression (Radhakrishnan et al., 9 Jan 2024), the RFM iterates between minimization and adaptive filtering:
$W_t = \operatorname{argmin}_W \|W\|_F^2 \ \text{subject to} \ \langle A_i M_t, W \rangle = y_i$

$M_{t+1} = \phi(M_t^T W_t^T W_t M_t)$

with $\phi$ a spectral function, often a matrix power. - For interpretable QSPR modeling (Shen et al., 21 Nov 2024), feature importance per sample is calculated as $x_i^T M x_i$ , with global importance as the mean over all samples. - In xRFM for scalable tabular learning (Beaglehole et al., 12 Aug 2025), partitioning is performed by median splits along directions of maximal AGOP eigenvectors, yielding local feature learning at each leaf node.

2. Theoretical Guarantees and Analytical Results

RFMs provide theoretical consistency and expressivity beyond classical algorithms:

Consistency of Risk-RFE (Dasgupta et al., 2013): Under convex, locally Lipschitz losses and bounded, measurable kernels, risk-RFE is uniformly consistent in identifying the correct feature subspace, provided RKHS complexity is appropriately controlled (via entropy numbers) and a minimum risk gap $\varepsilon_0>0$ exists for elimination of relevant features.
Duality and Function Space Extension (Chen et al., 2023): A duality framework links estimation in function spaces $\mathcal{F}_{p,\pi}$ to approximation in conjugate feature spaces, with sharp, dimension-independent approximation rates for $p>1$ :

$\inf_{\{c_j\}} \left\| f - \frac{1}{m}\sum_j c_j \varphi(\cdot,v_j) \right\|_{q,\rho} \lesssim \left[ \frac{M_q^{p'-2} R_q^2 \log^3 m + M_q \log(1/\delta)}{m} \right]^{1/p'}$

For $p<2$ , RFMs can approximate functions outside the RKHS, broadening learning capacity beyond kernel methods.

Dimensionality Reduction and Sparse Recovery (Radhakrishnan et al., 9 Jan 2024): In linear regimes, RFM generalizes IRLS algorithms, recovering low-rank matrices and sparse vectors efficiently. Fixed points of the iterative update correspond to critical points of regularized spectral objectives, with efficient (SVD-free) scaling to matrices with millions of entries.

3. Practical Applications and Experimental Results

RFMs have been applied across:

Feature Selection and Elimination (Dasgupta et al., 2013, López-De-Castro et al., 29 May 2024): RFMs eliminate features by risk or non-conformity, producing concise feature sets and demonstrating higher accuracy and automatic stopping than traditional RFE or LASSO across tabular, protein, and image datasets.
Feature Learning for Kernel Machines and Tabular Data (Radhakrishnan et al., 2022, Beaglehole et al., 12 Aug 2025): By recursive metric learning via AGOP, RFMs achieve state-of-the-art results (SOTA) on over 150 tabular datasets, outperforming both kernel and deep methods. xRFM combines these advances with tree-based partitioning for scalability and local adaptation.
QSPR Molecular Modeling (Shen et al., 21 Nov 2024): RFMs provide interpretable modeling for molecular properties using tailored fingerprints (MACCS, Morgan, hybrid). Both local and global feature importance are extracted from AGOP-weighted kernels, facilitating molecule design by identifying actionable substructures.
Music Generation and Controllability (Zhao et al., 21 Oct 2025): MusicRFM adapts RFMs to steer frozen autoregressive music models by injecting concept directions (AGOP eigenvectors) into activation flows, achieving fine-grained control over musical attributes and measurable improvements in generation accuracy while preserving prompt fidelity.
Emergence and Generalization Phenomena (Mallinar et al., 29 Jul 2024, Gupta et al., 2023): RFMs replicate ‘grokking’ phenomena (sharp generalization phase transitions) observed in neural networks by progressively learning block-circulant features for modular arithmetic tasks, closely mirroring the learning process theorized to underlie deep learning’s emergent behavior. This aligns with the double descent MSE pattern observed under controlled noise and overparameterization (Gupta et al., 2023).

4. Interpretability and Model Analysis

A major strength of RFMs is inherent interpretability due to gradient-based feature importance and structural transparency:

Feature Matrix Analysis:

The learned metric $M$ —whether via risk-RFE, AGOP, or conformal non-conformity—can be analyzed directly for diagonal dominance (per-feature relevance), eigenstructure (feature combinations), and block structure (as in modular arithmetic with circulant blocks) (Mallinar et al., 29 Jul 2024).

Feature Importance Scoring:

RFMs support both local analysis ( $x_i^T M x_i$ for individual samples) and global ranking (mean score across dataset), with scores correlating to physically or chemically meaningful descriptors, enabling actionable feedback for applications such as drug design or data screening (Shen et al., 21 Nov 2024).

Uncertainty Quantification and Consistency:

CRFE extends the RFM paradigm into conformal prediction, enabling recursive elimination based on non-conformity measures tied directly to sample ‘strangeness’ and introducing robust stopping and consistency indices across data splits (López-De-Castro et al., 29 May 2024).

5. Scalability, Efficiency, and Broader Connections

RFMs have been designed for computational efficiency and broader methodological implications:

Algorithmic Complexity:

Recursive risk-RFE exhibits linear-to-quadratic scaling, while tree-partitioned xRFM operates in $O(n \log n)$ time per training and $O(\log n)$ per inference (Beaglehole et al., 12 Aug 2025). SVD-free matrix completion provides practical scaling for linear RFM variants (Radhakrishnan et al., 9 Jan 2024).

Connections to Other Recursive Paradigms:

The recursive structure in RM code decoding (Dumer et al., 2017) offers a blueprint for layered, candidate-maintaining, and Bayesian-updating strategies—a conceptual precursor to RFM architectures.

Kernel Extension and Model Generalization:

RFMs are compatible with diverse kernels, including Laplace, Matern, Gaussian, and rational quadratic (Shen et al., 21 Nov 2024). AGOP-based adaptation introduces feature learning into previously static kernel machines and is generalizable to non-neural, non-gradient-based models.

6. Open Questions and Future Directions

Research into RFMs suggests several future directions and open problems:

Theoretical Analysis:

Further theorization is needed to fully understand the double descent MSE pattern and emergent phase transitions in RFMs, especially parallels with neural networks and random feature models (Gupta et al., 2023, Mallinar et al., 29 Jul 2024).

Algorithmic Extension:

Investigating RFMs with richer recursive structures (multi-way splits, deeper trees, higher-order coupling), broader kernel types, and alternative domains (e.g., modular learning, sequence modeling, generative models) presents promising avenues for performance and interpretability enhancement.

Practical Deployment:

Expansion into real-time and controlled generation, large-scale data applications, and high-dimensional function approximation (Zhao et al., 21 Oct 2025, Shen et al., 21 Nov 2024, Beaglehole et al., 12 Aug 2025) is ongoing.

Unification with Neural Feature Learning:

The alignment between AGOP and neural feature matrices (NFM), as formalized in the Deep Neural Feature Ansatz (Radhakrishnan et al., 2022), merits further empirical and theoretical exploration—potentially informing broader principles behind feature learning and generalization in machine learning.

7. Representative Mathematical Formulas and Algorithmic Components

Mechanism	Formula/Algorithmic Step	Context
Risk-RFE	$\mathcal{R}^{\text{reg},\lambda}_{L,D,H}(f)$ , $H^J$ projection	Feature Elimination in Kernel Machines (Dasgupta et al., 2013)
AGOP Update	$M \leftarrow \frac{1}{n}\sum_{p=1}^n \nabla f(x_p)\nabla f(x_p)^T$	Kernel Feature Learning (Radhakrishnan et al., 2022), Interpretable Modeling (Shen et al., 21 Nov 2024)
Tree Split Direction	$v = \text{top eigenvector of AGOP}(f, S)$	xRFM Tabular Partitioning (Beaglehole et al., 12 Aug 2025)
Music Generation Control	$h'_t,\ell = h_{t,\ell} + \eta_\ell(t) q_{\ell,j^*}$	Steering MusicGen via RFMs (Zhao et al., 21 Oct 2025)
Low-rank Matrix Filter	$M_{t+1} = \phi(M_t^T W_t^T W_t M_t)$ , power spectral update	Matrix Recovery (Radhakrishnan et al., 9 Jan 2024)
Conformal Elimination	Remove feature with max $\beta_j$ , $\beta_j = \sum_i \phi(z_i, \mathcal{Z}, j)$	Conformal RFMs (López-De-Castro et al., 29 May 2024)

Conclusion

Recursive Feature Machines comprise a systematic and theoretically robust family of algorithms for feature selection, adaptation, and learning. By leveraging empirical risk increments, gradient outer products, and recursive mechanisms, RFMs achieve strong generalization, interpretability, and scalability across a range of machine learning applications. Their mathematical transparency and adaptability make RFMs essential tools for modern machine learning, with ongoing research continuing to expand their theoretical and practical horizons.