Penalized mRMR for Sparse Feature Selection

Updated 31 August 2025

Penalized mRMR is a feature selection framework that integrates target relevance and inter-feature redundancy control via a continuous, penalized optimization formulation.
It reformulates the discrete mRMR criterion using dependency measures like mutual information, applying LASSO, SCAD, or MCP penalties to encourage sparsity.
By leveraging nonconvex penalties and a knockout filter for FDR control, the method robustly identifies informative features in high-dimensional settings.

Penalized Minimum Redundancy Maximum Relevance (mRMR) refers to a family of feature selection methodologies for high-dimensional data that aim to extract subsets of features that are simultaneously maximally relevant to the target variable and minimally redundant with respect to each other. The penalized mRMR principle extends classical mRMR by incorporating explicit penalization—typically via continuous optimization with regularization or via explicit penalty parameters—so as to provide sharper control over feature sparsity, redundancy, and stability, including guarantees such as false discovery rate (FDR) control.

1. Mathematical Formulation

At the core of penalized mRMR methods is the reinterpretation of the discrete mRMR objective as a continuous penalized optimization. The standard mRMR criterion selects a subset $S$ of features maximizing the trade-off between relevance and redundancy: $\mathrm{mRMR}(S) = \frac{1}{|S|} \sum_{j\in S} D(X_j,Y) - \frac{1}{|S|^2} \sum_{j,k\in S} D(X_j,X_k)$ where $D(\cdot,\cdot)$ is a dependency measure (e.g., mutual information, HSIC).

The penalized mRMR procedure introduces a vector of relaxation parameters $\theta\in\mathbb{R}_+^p$ . The loss to be minimized is

$L_{v,n}(\theta) = \frac{1}{2} \tau D_v(Y,Y) - \sum_k \theta_k \tau D_v(X_k, Y) + \frac{1}{2} \sum_{k,l} \theta_k \theta_l \tau D_v(X_k, X_l)$

with $\tau D_v(\cdot,\cdot)$ a V-statistic estimator for an association measure, and $\theta_k$ (continuous, non-negative) representing the importance of feature $k$ . Sparsity is induced by a penalty $\sum_k p_\lambda(\theta_k)$ , where $p_\lambda(\cdot)$ is, e.g., the LASSO ( $\lambda \theta$ ), SCAD, or MCP nonconvex regularizers: $\min_{\theta\in\mathbb{R}_+^p}~~ n \cdot L_{v,n}(\theta) + n \sum_k p_\lambda(\theta_k)$ This convex–nonconvex hybrid framework ensures that features with insufficient marginal utility or redundant information are assigned zero coefficients.

A table summarizes choices:

Term	Description	Typical Choices
$D(\cdot,\cdot)$	Dependency measure	MI, HSIC, projection correlation
$p_\lambda(\cdot)$	Penalty	LASSO / SCAD / MCP
$\theta_k$	Feature coefficient (relax.)	Continuous, $\geq 0$

2. Feature Selection Mechanism and Sparsity

This penalized framework achieves feature selection by driving many $\theta_k$ to zero. Nonconvex penalties (such as SCAD or MCP) are explicitly designed to ensure:

Small coefficients are shrunk towards zero (eliminating inactive features).
Large, informative coefficients face negligible penalty (avoiding estimation bias).
Sparsistency: Under appropriate regularity, the method consistently identifies the true set of non-informative features (i.e., those with $\theta_k = 0$ ).

Features with high $D(X_k,Y)$ and low redundancy $D(X_k, X_l)$ for selected $l$ will be retained. Inactive or highly redundant features with low added value relative to the penalty threshold are systematically eliminated.

3. FDR Control via Knockoff Multi-Stage Selection

To control false discoveries, the penalized mRMR pipeline incorporates a multi-stage procedure using the knockoff filter. The procedure is:

(a) Knockoff construction: Generate, for each feature $X_k$ , an auxiliary knockoff $\tilde{X}_k$ with coordinated statistical properties (same means/covariances).

(b) Statistic computation: For each feature, compute the knockoff statistic $W_k = \hat \theta_k - \tilde{\theta}_k$ , where $\hat{\theta}_k$ is the estimated coefficient for $X_k$ and $\tilde{\theta}_k$ that for $\tilde{X}_k$ .

(c) Thresholding: Set a level $\alpha$ for FDR control, and select features $k$ such that $W_k \geq T(\alpha)$ , where $T(\alpha)$ is the minimal threshold ensuring

$\mathrm{FDP}(T) = \frac{1+\#\{k : W_k \leq -T\}}{\#\{k : W_k \geq T\} \vee 1} \leq \alpha$

This adaptive procedure ensures—conditional on the screening—that the expected FDR among the selected features does not exceed the user-specified $\alpha$ .

In high-dimensional settings ( $p \gg n$ ), a data splitting step is used: pre-screening is run on a subset to ensure $2p < n$ before knockoff construction, and selected features are merged with the main data for final selection.

Penalized mRMR shares a conceptual structure with other dependency-based sparse methods such as HSIC-LASSO. Both:

Use a kernel-based measure of feature–target dependence.
Penalize redundancy via pairwise similarity in the quadratic term.
Rely on $\ell_1$ or nonconvex penalties for sparsity.

Distinctive aspects of penalized mRMR:

Tends to be more conservative in the number of retained features, yielding sparser models.
Feature selection depends only on specifying an FDR threshold, not the explicit number of features.
Use of nonconvex penalties ensures better support recovery compared to HSIC-LASSO's LASSO-only regime, especially under strong feature correlations.

Simulation studies and real high-dimensional biological data show that penalized mRMR usually selects a smaller (or comparable) set of active features with similar classification accuracy and improved FDR control compared to HSIC-LASSO.

5. Practical Implementation and Usage

Penalty/solver: LASSO-penalized variants are convex and can be implemented using standard solvers (e.g., CVXPY). Nonconvex penalties (SCAD, MCP) require custom solvers, commonly solved with the local linear approximation (LLA) algorithm, which can be initialized with the LASSO path.

Parameter selection:

$\lambda$ (regularization strength) and $\alpha$ (FDR threshold) are chosen via cross-validation or a hold-out validation set.
If the knockoff filter finds no active features at a candidate FDR level, the threshold is relaxed and the procedure is repeated.

Association measure: The method is agnostic to the association measure $D(\cdot,\cdot)$ —projection correlation and normalized HSIC are both shown to work well.

High-dimensional adaptation: For $p\gg n$ , data splitting and screening are essential to satisfy the model-X knockoff's requirement ($2p

Software: Reference implementation is provided at https://github.com/PeterJackNaylor/SmRMR (Naylor et al., 26 Aug 2025).

6. Empirical Performance and Applications

Empirical evaluation on synthetic processes (linear, nonlinear, discrete response) and real-world datasets (gene expression, GWAS, high-dimensional images) demonstrates:

FDR-controlled selection: The method attains targeted levels of FDR, with true positive rate (TPR), false positive rate (FPR), and predictive accuracy comparable to or better than existing sparse methods.
Parsimony: Fewer, more interpretable features are chosen.
Generalization: Comparable predictive performance to HSIC-LASSO with a smaller, more conservative feature set.
Robustness: Maintains selection validity across data modalities and sample sizes.

This suggests penalized mRMR is especially suitable in scientific settings where controlling the number of discoveries is critical and model parsimony is valued.

7. Limitations and Considerations

The necessity of data splitting to accommodate the model-X knockoff constraint can reduce statistical power, though recycling strategies may ameliorate this.
Greedy approximations may be needed when extremely high dimensionality impedes convergence of the continuous optimization.
The method's effectiveness depends on the association measure, penalty parameterization, and FDR threshold.

A plausible implication is that advances in knockoff generation for $p > n$ and scalable nonconvex optimization will further enhance the applicability of penalized mRMR to ultra-high-dimensional biological and sensor data.

In summary, the penalized mRMR approach reinterprets feature selection as a continuous, sparsity-inducing optimization problem in which feature importance is determined through the joint minimization of redundancy and maximization of target relevance—augmented by explicit FDR control in the presence of correlation structure—offering a robust platform for discovery-oriented variable selection in modern high-dimensional data regimes (Naylor et al., 26 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Sparse minimum Redundancy Maximum Relevance for feature selection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Penalized Minimum Redundancy Maximum Relevance (mRMR).

Penalized mRMR for Sparse Feature Selection

1. Mathematical Formulation

2. Feature Selection Mechanism and Sparsity

3. FDR Control via Knockoff Multi-Stage Selection

4. Comparison with HSIC-LASSO and Related Methods

5. Practical Implementation and Usage

6. Empirical Performance and Applications

7. Limitations and Considerations

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics