Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 196 tok/s Pro
2000 character limit reached

Penalized mRMR for Sparse Feature Selection

Updated 31 August 2025
  • Penalized mRMR is a feature selection framework that integrates target relevance and inter-feature redundancy control via a continuous, penalized optimization formulation.
  • It reformulates the discrete mRMR criterion using dependency measures like mutual information, applying LASSO, SCAD, or MCP penalties to encourage sparsity.
  • By leveraging nonconvex penalties and a knockout filter for FDR control, the method robustly identifies informative features in high-dimensional settings.

Penalized Minimum Redundancy Maximum Relevance (mRMR) refers to a family of feature selection methodologies for high-dimensional data that aim to extract subsets of features that are simultaneously maximally relevant to the target variable and minimally redundant with respect to each other. The penalized mRMR principle extends classical mRMR by incorporating explicit penalization—typically via continuous optimization with regularization or via explicit penalty parameters—so as to provide sharper control over feature sparsity, redundancy, and stability, including guarantees such as false discovery rate (FDR) control.

1. Mathematical Formulation

At the core of penalized mRMR methods is the reinterpretation of the discrete mRMR objective as a continuous penalized optimization. The standard mRMR criterion selects a subset SS of features maximizing the trade-off between relevance and redundancy: mRMR(S)=1SjSD(Xj,Y)1S2j,kSD(Xj,Xk)\mathrm{mRMR}(S) = \frac{1}{|S|} \sum_{j\in S} D(X_j,Y) - \frac{1}{|S|^2} \sum_{j,k\in S} D(X_j,X_k) where D(,)D(\cdot,\cdot) is a dependency measure (e.g., mutual information, HSIC).

The penalized mRMR procedure introduces a vector of relaxation parameters θR+p\theta\in\mathbb{R}_+^p. The loss to be minimized is

Lv,n(θ)=12τDv(Y,Y)kθkτDv(Xk,Y)+12k,lθkθlτDv(Xk,Xl)L_{v,n}(\theta) = \frac{1}{2} \tau D_v(Y,Y) - \sum_k \theta_k \tau D_v(X_k, Y) + \frac{1}{2} \sum_{k,l} \theta_k \theta_l \tau D_v(X_k, X_l)

with τDv(,)\tau D_v(\cdot,\cdot) a V-statistic estimator for an association measure, and θk\theta_k (continuous, non-negative) representing the importance of feature kk. Sparsity is induced by a penalty kpλ(θk)\sum_k p_\lambda(\theta_k), where pλ()p_\lambda(\cdot) is, e.g., the LASSO (λθ\lambda \theta), SCAD, or MCP nonconvex regularizers: minθR+p  nLv,n(θ)+nkpλ(θk)\min_{\theta\in\mathbb{R}_+^p}~~ n \cdot L_{v,n}(\theta) + n \sum_k p_\lambda(\theta_k) This convex–nonconvex hybrid framework ensures that features with insufficient marginal utility or redundant information are assigned zero coefficients.

A table summarizes choices:

Term Description Typical Choices
D(,)D(\cdot,\cdot) Dependency measure MI, HSIC, projection correlation
pλ()p_\lambda(\cdot) Penalty LASSO / SCAD / MCP
θk\theta_k Feature coefficient (relax.) Continuous, 0\geq 0

2. Feature Selection Mechanism and Sparsity

This penalized framework achieves feature selection by driving many θk\theta_k to zero. Nonconvex penalties (such as SCAD or MCP) are explicitly designed to ensure:

  • Small coefficients are shrunk towards zero (eliminating inactive features).
  • Large, informative coefficients face negligible penalty (avoiding estimation bias).
  • Sparsistency: Under appropriate regularity, the method consistently identifies the true set of non-informative features (i.e., those with θk=0\theta_k = 0).

Features with high D(Xk,Y)D(X_k,Y) and low redundancy D(Xk,Xl)D(X_k, X_l) for selected ll will be retained. Inactive or highly redundant features with low added value relative to the penalty threshold are systematically eliminated.

3. FDR Control via Knockoff Multi-Stage Selection

To control false discoveries, the penalized mRMR pipeline incorporates a multi-stage procedure using the knockoff filter. The procedure is:

(a) Knockoff construction: Generate, for each feature XkX_k, an auxiliary knockoff X~k\tilde{X}_k with coordinated statistical properties (same means/covariances).

(b) Statistic computation: For each feature, compute the knockoff statistic Wk=θ^kθ~kW_k = \hat \theta_k - \tilde{\theta}_k, where θ^k\hat{\theta}_k is the estimated coefficient for XkX_k and θ~k\tilde{\theta}_k that for X~k\tilde{X}_k.

(c) Thresholding: Set a level α\alpha for FDR control, and select features kk such that WkT(α)W_k \geq T(\alpha), where T(α)T(\alpha) is the minimal threshold ensuring

FDP(T)=1+#{k:WkT}#{k:WkT}1α\mathrm{FDP}(T) = \frac{1+\#\{k : W_k \leq -T\}}{\#\{k : W_k \geq T\} \vee 1} \leq \alpha

This adaptive procedure ensures—conditional on the screening—that the expected FDR among the selected features does not exceed the user-specified α\alpha.

In high-dimensional settings (pnp \gg n), a data splitting step is used: pre-screening is run on a subset to ensure $2p < n$ before knockoff construction, and selected features are merged with the main data for final selection.

Penalized mRMR shares a conceptual structure with other dependency-based sparse methods such as HSIC-LASSO. Both:

  • Use a kernel-based measure of feature–target dependence.
  • Penalize redundancy via pairwise similarity in the quadratic term.
  • Rely on 1\ell_1 or nonconvex penalties for sparsity.

Distinctive aspects of penalized mRMR:

  • Tends to be more conservative in the number of retained features, yielding sparser models.
  • Feature selection depends only on specifying an FDR threshold, not the explicit number of features.
  • Use of nonconvex penalties ensures better support recovery compared to HSIC-LASSO's LASSO-only regime, especially under strong feature correlations.

Simulation studies and real high-dimensional biological data show that penalized mRMR usually selects a smaller (or comparable) set of active features with similar classification accuracy and improved FDR control compared to HSIC-LASSO.

5. Practical Implementation and Usage

Penalty/solver: LASSO-penalized variants are convex and can be implemented using standard solvers (e.g., CVXPY). Nonconvex penalties (SCAD, MCP) require custom solvers, commonly solved with the local linear approximation (LLA) algorithm, which can be initialized with the LASSO path.

Parameter selection:

  • λ\lambda (regularization strength) and α\alpha (FDR threshold) are chosen via cross-validation or a hold-out validation set.
  • If the knockoff filter finds no active features at a candidate FDR level, the threshold is relaxed and the procedure is repeated.

Association measure: The method is agnostic to the association measure D(,)D(\cdot,\cdot)—projection correlation and normalized HSIC are both shown to work well.

High-dimensional adaptation: For pnp\gg n, data splitting and screening are essential to satisfy the model-X knockoff's requirement ($2p

Software: Reference implementation is provided at https://github.com/PeterJackNaylor/SmRMR (Naylor et al., 26 Aug 2025).

6. Empirical Performance and Applications

Empirical evaluation on synthetic processes (linear, nonlinear, discrete response) and real-world datasets (gene expression, GWAS, high-dimensional images) demonstrates:

  • FDR-controlled selection: The method attains targeted levels of FDR, with true positive rate (TPR), false positive rate (FPR), and predictive accuracy comparable to or better than existing sparse methods.
  • Parsimony: Fewer, more interpretable features are chosen.
  • Generalization: Comparable predictive performance to HSIC-LASSO with a smaller, more conservative feature set.
  • Robustness: Maintains selection validity across data modalities and sample sizes.

This suggests penalized mRMR is especially suitable in scientific settings where controlling the number of discoveries is critical and model parsimony is valued.

7. Limitations and Considerations

  • The necessity of data splitting to accommodate the model-X knockoff constraint can reduce statistical power, though recycling strategies may ameliorate this.
  • Greedy approximations may be needed when extremely high dimensionality impedes convergence of the continuous optimization.
  • The method's effectiveness depends on the association measure, penalty parameterization, and FDR threshold.

A plausible implication is that advances in knockoff generation for p>np > n and scalable nonconvex optimization will further enhance the applicability of penalized mRMR to ultra-high-dimensional biological and sensor data.


In summary, the penalized mRMR approach reinterprets feature selection as a continuous, sparsity-inducing optimization problem in which feature importance is determined through the joint minimization of redundancy and maximization of target relevance—augmented by explicit FDR control in the presence of correlation structure—offering a robust platform for discovery-oriented variable selection in modern high-dimensional data regimes (Naylor et al., 26 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)