Importance-Guided Feature Reduction

Updated 22 December 2025

Importance-guided feature reduction is a set of algorithms that select or weight features based on quantitative importance metrics, reducing dimensionality while preserving predictive power.
These methods span filter, wrapper, deep learning, and game theoretic frameworks, with applications from gene expression analysis to neural network pruning.
The approach enhances model interpretability and efficiency by rigorously assessing feature contributions and addressing redundancy in high-dimensional data.

Importance-guided feature reduction comprises a family of algorithmic and theoretical approaches that select or weight features according to quantitative importance criteria in order to reduce input dimensionality while preserving maximal information, predictive performance, or interpretability. These methods underpin a wide range of practices in machine learning, from pre-processing in high-dimensional bioinformatics to interpretability-constrained dimensionality reduction and large-scale distributed learning. The technical foundation spans linear statistics, deep learning, game theory, and information theory; the unifying principle is the rigorous use of feature “importance”—as measured by univariate statistics, gradients, coalitional indices, or model perturbations—to prioritize which subset of covariates to retain or suppress.

Central to early importance-guided reduction is the computation and ranking of univariate or marginal importance measures. In the prototypical filter approach, each feature is assessed independently, such as via two-sample t-statistics $t_j$ for class separation in binary classification:

$t_j = \frac{\bar x_{j,1} - \bar x_{j,2}}{\sqrt{s_j^2\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$

with corresponding $p$ -values $p_j$ , and importance defined as $I_j=|t_j|$ or $I_j=1-p_j$ . Features are ranked in descending $I_j$ and a top- $k$ subset is selected. Empirical results demonstrate that for high-dimensional clinical microarray data ( $p=4000$ , $n=216$ ), 35% of gene features can show $p_j\approx 0$ , and optimal misclassification error rates (MCE) are achieved at $k\ll n$ (e.g., for LDA, test MCE = 0.02 at $k=15$ ) (Singh et al., 2014).

Wrappers layer more complex model-based evaluation atop such rankings, using sequential forward selection (SFS): iteratively add the feature which, when appended to the current subset and evaluated (e.g., via cross-validated MCE), yields the best improvement. However, in high-dimensional regimes, direct filter strategies frequently outperform wrappers due to reduced overfitting risk and lower computational cost (Singh et al., 2014).

Criterion	Filter	Wrapper (SFS)
Computation	$O(pd)$	$O(pk^2)$
Overfitting Risk	Low	Moderate-high (in $p\gg n$ )
Generalization	Robust in high dimensions	Model-dependent

2. Model-based and Deep Learning–driven Importance Mechanisms

Deep learning introduces powerful data-adaptive importance scoring. In neural models, featurewise importance can be encoded as a mask $m\in \mathbb{R}^D$ learned jointly with model weights. For example, the Complementary Feature Mask (CFM) framework computes $m$ from a trainable network applied to inputs, with softmax normalization to ensure interpretable scores. Critically, CFM augments the standard loss with a “complementary” branch by constructing a reversed mask $\tilde{m} = \text{softmax}(-\bar{z})$ , enforcing that masked-out features (those with low $m_j$ ) produce maximally uncertain outputs. This penalizes overconfident predictions on neglected features, stabilizing selection and reducing variance especially at low feature budget ( $k \ll D$ ), as evidenced in large-scale benchmarks (MNIST, fMNIST, madelon, etc.) (Liao et al., 2022).

Any differentiable deep-feature-selection method can be equipped with such a complementary branch with minimal code changes and without alternating optimization. The method demonstrably outperforms standard neural mask approaches across multiple datasets and is architecturally agnostic (Liao et al., 2022).

3. Game Theoretic and Coalitional Indices

Coalitional game theory provides formal, interaction-aware metrics for feature influence. The Banzhaf power index $\beta_i$ quantifies the average marginal impact of a feature $i$ on a model’s output across all coalitions $S \subseteq N \setminus \{i\}$ :

$\beta_i = \frac{1}{2^{n-1}} \sum_{S \subseteq N \setminus \{i\}} |F(S \cup \{i\}) - F(S)|$

Features with $\beta_i=0$ are “dummy” and can be pruned with zero loss in predictive accuracy; this gives a lossless, model-agnostic pruning guarantee (Kulynych et al., 2017). Empirical evaluation shows that Banzhaf indices align well with gradient-based and $L_1$ -regularized importance scores but capture additional cases that linear weights may miss (e.g., features nonlinearly but decisively controlling the output).

In the unsupervised setting, coalitional Shapley-value–based importance quantifies each feature’s average contribution to total correlation $C(S) = \sum_{X\in S} H(X) - H(S)$ among the set $S$ . Redundancy is explicitly penalized by excluding features with marginal correlation to the current selection above a threshold $\epsilon$ (SVFS), or subtracting it off at each ranking step (SVFR). These methods achieve state-of-the-art redundancy rates compared to spectral and graph-based feature selectors, while maintaining scalable approximations (bounded size coalitions or sampled orderings) (Balestra et al., 2022).

4. Advanced Importance Decomposition and Interaction-aware Reduction

Recent advances refine the distinction between unique, redundant, and synergistic contributions to feature importance. Using the Leave-One-Covariate-Out (LOCO) metric $L_{Z}(X\to Y)=\epsilon(Y|Z) - \epsilon(Y|X,Z)$ , where $\epsilon(\cdot)$ is the mean squared prediction error, high-order interaction effects are decomposed as:

Unique: $U=L_{T^*}(X\to Y)$ , minimized over all $Z\subseteq \mathbb{Z}$ ,
Redundant: $R=L_{\emptyset}(X\to Y)-L_{T^*}(X\to Y)$ ,
Synergistic: $S=L_{S^*}(X\to Y)-L_{\emptyset}(X\to Y)$ ,

with $S^*$ (maximizing synergy) and $T^*$ (minimizing redundancy) discovered by greedy, permutation-test–controlled search (Ontivero-Ortega et al., 2024). Features with high $R$ are routinely dropped, high $U$ always kept, and high $S$ considered for grouping or joint preservation, furnishing a principled, interaction-aware reduction pipeline.

5. Nonparametric, Model-agnostic Impact Assessment

Data-driven, model-agnostic methods such as stratified partial dependence provide robust, nonparametric feature impact scores—unlike model-weight–oriented importances that may be unstable across models. The “cmr” (cumulative mean response) impact is estimated by integrating the nonparametric partial dependence function of each feature, normalized so all importances sum to 1. The procedure is competitive with permutation and SHAP importance on standard regression benchmarks (Parr et al., 2020).

Unlike classical filter methods, nonparametric impact is not dependent on model class and is highly robust to feature codependence, but it is currently limited to regression. Categorical features are handled by mean-centering their partial dependence and applying the same aggregation scheme.

6. Large-scale, Redundancy-aware Distributed Feature Reduction

Scalability is addressed by algorithms such as BELIEF, which aggregates instance-class pairwise distances to form global feature weights and incorporates collision-based redundancy metrics (mCR) at linear cost in the number of selected features (Ramírez-Gallego et al., 2018). By using MapReduce/Spark-style broadcast, map, and reduce primitives, BELIEF scales to tens of millions of features, achieving comparable or improved classification accuracy relative to distributed mRMR and DiReliefF with 5–10× lower runtime. The ordinal per-feature importance ranking is augmented at each round with redundancy penalties, using a sequential forward selection to maximize $w_j - \theta \sum_{h\in S}I_\alpha(h,j)$ for selected set $S$ .

This approach enables redundancy-robust feature selection on massive datasets, with empirical evidence of 60–80% lower redundancy rates in the selected subset (Ramírez-Gallego et al., 2018).

7. Specialized and Hybrid Approaches in Deep and Structured Models

Structured data or deep architectures require customized reduction. In convolutional networks, channel pruning guided by classification loss and feature survival likelihood (CPLI) optimizes a joint loss penalizing reconstruction error weighted by the feature’s final-loss gradient and suppressing those likely to be pruned in subsequent layers. This bi-level LASSO–least-squares alternating optimization yields state-of-the-art parameter and computation reductions with minimal or negative accuracy loss on CIFAR-10, ImageNet, and UCF-101 (Guo et al., 2020).

In hybrid collaborative-content recommenders, hybrid matrix factorization enriches feature vectors with collaborative behavior before extracting a low-rank SVD embedding; features are then selected by maximizing the rectangular volume of their k-dimensional embeddings, yielding both efficient compression and superior cold-start recommendation quality (Sukhorukov et al., 8 Aug 2025).

In deep feature selection, four spline-based Kolmogorov–Arnold Network (KAN) importance measures (KAN-L1, KAN-L2, KAN-SI, KAN-KO) provide direct access to nonlinear, functional feature relevance and deliver competitive or superior performance to classical selectors, especially in high-dimensional, noisy, or heterogeneous data (Akazan et al., 27 Sep 2025).

8. Practical Guidelines and Empirical Best Practices

Practical workflows universally begin with a fast marginal filter or mask (univariate statistics, “cmr,” or neural mask), plot downstream error versus feature count to find inflection points, and, when beneficial, perform limited wrapper or redundancy-aware selection atop an already reduced pool (Singh et al., 2014, Ramírez-Gallego et al., 2018). Empirically, the following principles recur:

Maintain selected feature count well below the minimum class count in discriminant/classifier models to ensure accurate covariance estimation,
In large-scale settings, prefer redundancy-aware distributed methods for stability,
Validate reduced feature sets by cross-validation or hold-out, as well as rerunning importance scoring to avoid overfitting,
When interpretability or visualization is required, meta-methods such as DimenFix can fix or softly constrain axes in a nonlinear low-dimensional embedding to reflect the raw most-important feature with negligible run-time and often improved downstream task performance (Luo et al., 2022).

Continued innovation in importance-guided reduction is marked by closer coupling of global interaction awareness with computational scalability and principled handling of redundancy, nonlinearity, and high-order effects.