Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gradient Boosted Feature Selection (GBFS)

Updated 10 February 2026
  • GBFS is a nonlinear, scalable method that embeds feature selection within gradient-boosted tree ensembles by penalizing the introduction of new features.
  • It modifies the tree-building impurity criterion with a fixed penalty for new features, encouraging reuse and enabling the capture of non-linear interactions.
  • GBFS demonstrates competitive performance on high-dimensional data by achieving sparse, interpretable models with improved computational efficiency over traditional methods.

Gradient Boosted Feature Selection (GBFS) is a nonlinear, scalable framework for embedded feature selection in gradient-boosted tree ensembles. By regularizing the introduction of new features in the boosting process, GBFS isolates a compact, interpretable, and high-performing subset of features, naturally capturing non-linear interactions and supporting complex structured sparsity regimes. It directly modifies the tree-building impurity criterion in gradient boosting to penalize the use of previously unseen features, resulting in models that are both statistically sparse and practically efficient, with performance that is competitive with or surpasses state-of-the-art filter and wrapper approaches on high-dimensional, real-world data (Xu et al., 2019).

1. Mathematical and Algorithmic Principles

Let {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n with xiRdx_i \in \mathbb{R}^d and yi{1,+1}y_i \in \{-1, +1\} denote the training data. The model is an additive ensemble: H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x), where each hth_t is a regression tree from a hypothesis class H\mathcal{H}, βt0\beta_t \geq 0, and TT is the number of boosting rounds. GBFS introduces a binary feature usage matrix F{0,1}d×TF \in \{0,1\}^{d \times T}, with FftF_{f t} indicating whether feature xiRdx_i \in \mathbb{R}^d0 is used in tree xiRdx_i \in \mathbb{R}^d1.

To induce sparsity, GBFS uses a capped-xiRdx_i \in \mathbb{R}^d2 surrogate

xiRdx_i \in \mathbb{R}^d3

and solves the non-convex optimization problem: xiRdx_i \in \mathbb{R}^d4 where xiRdx_i \in \mathbb{R}^d5 is a loss function (e.g., logistic or squared loss), xiRdx_i \in \mathbb{R}^d6 (often set to the boosting step size xiRdx_i \in \mathbb{R}^d7), and xiRdx_i \in \mathbb{R}^d8 controls the per-new-feature penalty. This construction ensures that each new feature incurs a fixed cost, while repeated splits on already-introduced features are unpenalized.

Model optimization is performed via a variant of coordinate descent in function space, i.e., a gradient boosting procedure. At each boosting iteration:

  • Negative gradients are computed.
  • Unused features are tracked via xiRdx_i \in \mathbb{R}^d9.
  • The current tree is trained to minimize

yi{1,+1}y_i \in \{-1, +1\}0

effectively adding yi{1,+1}y_i \in \{-1, +1\}1 to the split impurity score whenever a new feature is introduced.

After yi{1,+1}y_i \in \{-1, +1\}2 rounds, the model yi{1,+1}y_i \in \{-1, +1\}3 is sparse in the feature space, minimizing loss subject to penalized feature complexity (Xu et al., 2019).

2. Sparsity Mechanisms and Structured Priors

GBFS induces sparsity by imposing a cost yi{1,+1}y_i \in \{-1, +1\}4 only when a feature is introduced for the first time—subsequent uses are “free.” This design encourages each tree to reuse features whenever possible, while weak or redundant features are filtered out unless they offer substantial objective reduction. Beyond simple sparsity, GBFS accommodates arbitrary structured sparsity patterns:

  • Feature-group structures (“bags”) can be encoded by letting yi{1,+1}y_i \in \{-1, +1\}5 vanish for all features in a group once any member is selected.
  • Side-information functions allow the formulation of custom cost functions for incorporating domain knowledge.

The resulting model is capable of both group sparsity (group Lasso-like behavior) and standard sparsity, without sacrificing nonlinear expressive power.

3. Empirical Performance and Use Cases

Empirical results demonstrate consistent or improved accuracy compared with L1-regularized logistic regression, Random Forest feature selection, mRMR, and HSIC-Lasso in strong feature-sparsity regimes. Illustrative examples include:

  • Synthetic XOR problem: GBFS selects exactly the relevant interacting pair, achieving 0% test error, unlike L1-LR, which fails in this non-linear setting.
  • Gene-expression data (colon dataset): When features are pre-clustered, GBFS focused on a single biologically meaningful cluster yields yi{1,+1}y_i \in \{-1, +1\}615% test error with far fewer features, competing with group-Lasso and Random Forest FS.
  • Large-scale settings: On kddcup99 (yi{1,+1}y_i \in \{-1, +1\}7M, yi{1,+1}y_i \in \{-1, +1\}8), GBFS attains lower error with fewer features than Random Forest FS and L1-LR.

GBFS is well-suited to high-dimensional, low-sample settings (e.g., yi{1,+1}y_i \in \{-1, +1\}9, as in certain genomics datasets), with competitive gains over kernel and information-theoretic approaches (Xu et al., 2019).

4. Computational Complexity and Scalability

For a fixed maximum tree depth H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),0, a single tree fit using standard CART on H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),1 samples and H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),2 features costs H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),3, but with pre-sorting and shallow trees (as is typical under strong regularization), the practical cost is often H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),4. H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),5 such iterations yield overall complexity H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),6. The memory requirements mainly stem from the data matrix (H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),7) and gradient storage (H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),8).

To address scalability with very large H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x),9, one may employ subsampling, “feature-bagging” heuristics, or the scalable group-testing and binary-search methods developed in (Han et al., 2021), which reduce per-node complexity from hth_t0 to hth_t1 with hth_t2 the feature budget. This makes GBFS directly applicable to ultrahigh-dimensional regimes with minimal overhead.

GBFS’s core methodology admits several notable extensions:

  • Group Testing–GBM (GT-GBM): For hth_t3, group-testing and randomized binary search efficiently identify promising new features without full hth_t4 scans (Han et al., 2021).
  • Multitask Selection: Group and task-specific sparsity penalties allow simultaneous discovery of shared and task-specific features.
  • Differentiable Skinny Trees: End-to-end differentiable tree ensembles with group hth_t5 penalties enforce global feature budget constraints, enabling proximal gradient training in tensorized architectures with convergence guarantees (Ibrahim et al., 2023).
  • Unbiased Feature Importance: Cross-validation and out-of-bag approaches (e.g., unbiased gain (Zhang et al., 2023), cross-validated splits (Adler et al., 2021)) mitigate high-cardinality feature bias and support more interpretable selection in the presence of categorical variables.
  • False Discovery Control: Integration with stability selection (e.g., IPSSGB (Melikechi et al., 2024)) yields finite-sample bounds on false positive selection, providing feature-wise q-values and supporting high-dimensional inference scenarios.

6. Practical Considerations and Hyperparameter Tuning

Key hyperparameters include:

  • Learning rate hth_t6: typically 0.01–0.1.
  • Boosting iterations hth_t7: a proxy for model capacity; tuned via cross-validation.
  • Tree depth hth_t8: controls the complexity of feature interactions.
  • Feature penalty hth_t9: higher values induce stricter sparsity; critical for the tradeoff between performance and model compression.

Typical practice involves joint cross-validation over H\mathcal{H}0, with row subsampling or parallelization for large H\mathcal{H}1. Feature-subsampling or group-testing schemes are recommended for very large H\mathcal{H}2. For structured feature selection (e.g., with feature groups or bags), custom cost functions via H\mathcal{H}3 may be programmed directly into the impurity scoring.

7. Applications and Limitations

GBFS is particularly effective where interpretability, test-time efficiency, and robustness to irrelevant features are critical: bioinformatics (gene selection), neuroscience, imaging, computer vision (e.g., object detection with HOG features), and large-scale click-through modeling.

Principal strengths:

  • Single-stage, joint feature selection and learning.
  • Nonlinear interaction discovery (trees as base learners).
  • Linear-in-H\mathcal{H}4 scalability (barring extremely large H\mathcal{H}5).
  • Support for domain-specific sparsity via custom priors.

Limitations include:

  • Computational overhead in extremely high-dimensional H\mathcal{H}6 without subsampling or group testing.
  • Nonconvexity—solutions are only locally optimal.
  • Feature importance interpretability may be affected in highly collinear or correlated regimes unless care is taken with attribution methods and regularization parameters (Xu et al., 2019, Ibrahim et al., 2023, Adler et al., 2021).

In summary, Gradient Boosted Feature Selection offers a mathematically principled, empirically validated, and extensible toolkit for embedded feature selection in tree ensembles. It unites sparse modeling, nonlinear interaction modeling, and domain-specific priors within a single cost-minimizing framework suitable for modern high-dimensional data analysis.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient Boosted Feature Selection (GBFS).