Papers
Topics
Authors
Recent
2000 character limit reached

Loss-guided Feature Selection

Updated 13 February 2026
  • Loss-guided feature selection is a method that directly minimizes task loss to choose relevant features, ensuring model accuracy and sparsity.
  • It employs differentiable gate-based, wrapper, and game-theoretic techniques to quantify feature contributions and control redundancy.
  • Empirical studies and theoretical analyses show enhanced support recovery, test accuracy, and interpretability across high-dimensional and structured data.

Loss-guided feature selection refers to a broad family of algorithms and theoretical frameworks in which the selection of relevant features from high-dimensional data is governed directly by the minimization (or, in unsupervised cases, maximization) of a task-specific loss function. These methods differ substantially from heuristic or prefiltering criteria by coupling the feature selection process to risk minimization or utility maximization under a model class, thus allowing the interaction between features, redundancy, and model fit to be quantitatively resolved within the training dynamics. The category encompasses a spectrum of methodologies, from differentiable 0\ell_0-penalized objectives (using stochastic gates or continuous relaxations), through wrapper-style and stability-driven algorithms utilizing loss evaluations on subsets, to game-theoretic or Shapley value-based allocation of loss reductions among features. Both supervised and unsupervised versions are present in the literature, with empirical evidence demonstrating advantages in support recovery, test accuracy, and interpretability across diverse scenarios.

1. Formalization: Loss-guided Objectives and Relaxations

Canonical loss-guided feature selection frameworks posit an explicit dependence between the feature subset indicator z{0,1}Dz \in \{0,1\}^D (or continuous relaxation thereof) and the empirical risk:

minθ,z  1Nn=1NL(fθ(xnz),yn)+λz0.\min_{\theta, z}\; \frac{1}{N}\sum_{n=1}^N L(f_\theta(x_n \odot z), y_n) + \lambda \|z\|_0.

Here, the loss L(,)L(\cdot, \cdot) expresses task-specific error (e.g., squared loss, logistic loss), while the 0\ell_0 penalty enforces sparsity on the selection vector zz (Yamada et al., 2018). Exact combinatorial optimization is intractable for moderate DD, motivating continuous relaxations such as the Bernoulli-parameterized gates π[0,1]D\pi \in [0,1]^D, or differentiable surrogates for zz (e.g., clipped Gaussian, Hard-Concrete) (Yamada et al., 2018), which permit gradient-based optimization of both parameters and selection probabilities in a unified objective.

Unsupervised variants employ analogous relaxations, but with a loss function reflecting data geometry, e.g., gated Laplacian score:

L(μ;λ)=1mTr[X~TLX~X~]+λi=1dP(Zi>0)\mathcal{L}(\mu;\lambda) = -\frac{1}{m} \mathrm{Tr}[\widetilde{X}^T L_{\widetilde{X}} \widetilde{X}] + \lambda \sum_{i=1}^d \mathbb{P}(Z_i > 0)

with X~=XZ\widetilde{X}=X \odot Z and LX~L_{\widetilde{X}} the Laplacian built from selected features (Lindenbaum et al., 2020).

2. Differentiable Gate-based Methods

Stochastic gate-based feature selection, originally introduced in (Yamada et al., 2018), parameterizes per-feature selection via μd\mu_d and fixed-variance Gaussian noise ϵd\epsilon_d, yielding gates zd=clip[0,1](μd+ϵd)z_d = \mathrm{clip}_{[0,1]} (\mu_d + \epsilon_d). The expected sparsity is enforced by penalizing dΦ(μd/σ)\sum_d \Phi(\mu_d/\sigma), where Φ()\Phi(\cdot) is the standard Gaussian CDF. This enables end-to-end minimization of

EzSTG(μ,σ)[L(θ;Xz,y)]+λd=1DΦ(μd/σ)\mathbb{E}_{z \sim \text{STG}(\mu, \sigma)} \big[ \mathcal{L}(\theta; X \odot z, y) \big] + \lambda \sum_{d=1}^D \Phi(\mu_d/\sigma)

by backpropagation with Monte Carlo samples and the reparameterization trick. Compared to Hard-Concrete relaxations, STG's Gaussian noise yields lower gradient variance and more stable feature support identification (Yamada et al., 2018). This strategy achieves exact support recovery in synthetic DND \gg N settings, outperforms LASSO in nonlinear tasks, and generalizes across various domains (bioinformatics, NLP, survival analysis).

Similar gating and relaxation principles are extended to unsupervised settings in Differentiable Unsupervised Feature Selection (DUFS), which recomputes affinity graphs exclusively on gated inputs to robustly identify discriminative variables in high-noise data (Lindenbaum et al., 2020).

3. Loss-guided Subset Evaluation: Wrapper, greedy, and stability variants

A large class of loss-guided methods evaluate candidate feature sets directly by their impact on empirical loss (possibly after model retraining):

  • Greedy wrapper methods: Neural Greedy Pursuit (NGP) sequentially constructs a feature set by greedily including at each round the feature whose addition yields the minimum validation loss for a retrained neural network predictor. This approach, agnostic to architecture, empirically achieves lower false positive rates than top-down eliminators (e.g., drop-one-out), and displays a sharp phase transition in sample complexity at O(Nlog(P/N))O(N \log(P/N)) for perfect support recovery (Das et al., 2022).
  • Stability Selection: Loss-guided Stability Selection (LGS) applies repeated subsampling and model refitting, but crucially selects among candidate stable models (sets with high selection frequencies) by minimizing out-of-sample validation loss, rather than purely selection frequency or error bounds (Werner, 2022). This variant avoids the empty-model pathology of classical Stability Selection under high noise, trades strict control for predictive performance, and can incorporate an exhaustive search over small meta-stable sets for final selection.
  • Deep generative subset evaluators: The VTFS model frames feature subset selection as a sequence generation problem, embedding decision sequences in a variational latent space and learning a utility-evaluator MLP that predicts downstream ML accuracy. Gradient ascent in the embedding space is used to optimize predicted utility, after which the decoder autoregressively generates the candidate feature set (Ying et al., 2024). This paradigm enables end-to-end, differentiable, loss-guided search over the combinatorial subset space, outperforming both classical and RL-based wrapper baselines.

4. Game-Theoretic and Shapley Value Approaches

Loss-based Shapley-value feature selection defines the value of a feature in terms of its marginal improvement (or reduction in risk) when added to subsets, with payoffs derived from classification loss or training error:

  • SVEA (Shapley Value Error Apportioning): Features are scored by the amount of empirical hinge-loss each absorbs in the cooperative game, yielding a universal selection rule: select all features with negative apportioned error (i.e., those decreasing total loss when included) (Tripathi et al., 2020). Monte Carlo approximation and statistical interval estimation are employed for computational efficiency and robustness.
  • Loss-based SHAP selections: In LLpowershap, Shapley values are computed not with respect to model predictions but with respect to logistic loss reduction, and evaluated on held-out data to avoid overfitting (Madakkatel et al., 2024). Statistically significant features are distinguished from noise using Mann–Whitney U-testing against multiple initialized noise features, prioritizing features that robustly reduce test loss across bootstrap resamplings. This approach demonstrates superior noise suppression and consistent signal recovery in both simulations and large-scale biobank data.

These game-theoretic formulations provide principled justifications for selection thresholds and offer interpretability guarantees grounded in cooperative allocation of error reduction.

5. Loss-guided Feature Selection in Structured Settings

Loss-guided selection frameworks generalize to more complex settings, including:

  • Multivariate performance optimization: The method of (Mao et al., 2011) incorporates multivariate (non-decomposable) loss functions—e.g., F1-score, Precision@k, PRBEP—directly into the feature selection optimization. A budgeted sparsity vector dd and a generalized sparse regularizer are employed, and a two-layer cutting-plane strategy solves the resulting non-convex minimax problem. The framework supports both conventional and multiple-instance settings, achieving higher task-specific scores (e.g., F1) than 1\ell_1-SVM and SVM-RFE using orders of magnitude fewer features.
  • Graph- and redundancy-aware selection: In the InFusedLasso approach, sample-wise feature graphs are constructed, and a joint relevancy minus redundancy score (via multi-graph Jensen–Shannon similarity) is maximized together with squared error under a fused-lasso penalty (Bai et al., 2019). This enables selection of structurally coherent and non-redundant feature sets even in the presence of complex sample correlations.

6. Gradient Memory and Entropy-weighted Selection in Deep Models

Methods employing internal gradients and entropy as loss-driven indicators have been developed for domain-specific modalities:

  • Gradient Memory Bank (IEFS-GMB): For time-frequency and spatial features in EEG analysis, IEFS-GMB constructs a rolling memory of historical gradients, averages them with momentum and decay, and infers feature-channel importance via entropy-based weighting. The selection coefficient for each local feature is derived from the (softmax-)entropy of GradCAM-style activations, penalizing ambiguous or noisy contributions (Zhang et al., 18 Sep 2025). Empirical improvements in both predictive accuracy and interpretability are observed across multiple deep encoders and clinical datasets.

7. Theoretical Guarantees, Empirical Evidence, and Limitations

Across frameworks, loss-guided feature selection yields theoretical guarantees under mild conditions. Differentiable gate-based schemes provide support recovery and sample complexity bounds; greedy and wrapper-based loss-guided approaches show phase transition behavior; game-theoretic methods justify selection rules via cooperative game properties and interpretability. Empirically, such methods consistently outperform purely filter-based selectors and standard regularized models in signal recovery, sparsity, test accuracy, and robustness to noise. Limitations include computational cost due to repetitive model retraining (especially in combinatorial or wrapper-style approaches), the need for optimization hyperparameter tuning, and in some settings, difficulty scaling to ultra-high-dimensional domains without additional approximations or relaxations.

The trend towards explicitly incorporating task loss into all stages of subset evaluation continues to drive advances in scalable, robust, and interpretable feature selection for high-dimensional and structured data analysis.


References: (Yamada et al., 2018, Lindenbaum et al., 2020, Mao et al., 2011, Madakkatel et al., 2024, Tripathi et al., 2020, Ying et al., 2024, Das et al., 2022, Zhang et al., 18 Sep 2025, Bai et al., 2019, Werner, 2022, Genzel et al., 2016).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Loss-guided Feature Selection.