Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gradient Boosted Feature Selection (GBFS)

Updated 10 February 2026
  • GBFS is a nonlinear, scalable method that embeds feature selection within gradient-boosted tree ensembles by penalizing the introduction of new features.
  • It modifies the tree-building impurity criterion with a fixed penalty for new features, encouraging reuse and enabling the capture of non-linear interactions.
  • GBFS demonstrates competitive performance on high-dimensional data by achieving sparse, interpretable models with improved computational efficiency over traditional methods.

Gradient Boosted Feature Selection (GBFS) is a nonlinear, scalable framework for embedded feature selection in gradient-boosted tree ensembles. By regularizing the introduction of new features in the boosting process, GBFS isolates a compact, interpretable, and high-performing subset of features, naturally capturing non-linear interactions and supporting complex structured sparsity regimes. It directly modifies the tree-building impurity criterion in gradient boosting to penalize the use of previously unseen features, resulting in models that are both statistically sparse and practically efficient, with performance that is competitive with or surpasses state-of-the-art filter and wrapper approaches on high-dimensional, real-world data (Xu et al., 2019).

1. Mathematical and Algorithmic Principles

Let {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n with xiRdx_i \in \mathbb{R}^d and yi{1,+1}y_i \in \{-1, +1\} denote the training data. The model is an additive ensemble: H(x)=t=1Tβtht(x),H(x) = \sum_{t=1}^T \beta_t h_t(x), where each hth_t is a regression tree from a hypothesis class H\mathcal{H}, βt0\beta_t \geq 0, and TT is the number of boosting rounds. GBFS introduces a binary feature usage matrix F{0,1}d×TF \in \{0,1\}^{d \times T}, with FftF_{f t} indicating whether feature ff is used in tree tt.

To induce sparsity, GBFS uses a capped-1\ell_1 surrogate

qε(u)=min(u,ε)q_\varepsilon(u) = \min(|u|, \varepsilon)

and solves the non-convex optimization problem: min{βt,ht}i=1n(H(xi),yi)+μf=1dqε(t=1TFftβt),\min_{\{\beta_t, h_t\}} \sum_{i=1}^n \ell(H(x_i), y_i) + \mu \sum_{f=1}^d q_\varepsilon\Bigl(\sum_{t=1}^T F_{f t} \beta_t\Bigr), where \ell is a loss function (e.g., logistic or squared loss), ε\varepsilon (often set to the boosting step size α\alpha), and μ\mu controls the per-new-feature penalty. This construction ensures that each new feature incurs a fixed cost, while repeated splits on already-introduced features are unpenalized.

Model optimization is performed via a variant of coordinate descent in function space, i.e., a gradient boosting procedure. At each boosting iteration:

  • Negative gradients are computed.
  • Unused features are tracked via ϕf{0,1}\phi_f \in \{0,1\}.
  • The current tree is trained to minimize

$\frac{1}{2} \sum_{i=1}^n (g_i - h(x_i))^2 + \mu \sum_{f=1}^d \phi_f \indicator{\text{%%%%17%%%% uses %%%%18%%%%}},$

effectively adding μ\mu to the split impurity score whenever a new feature is introduced.

After TT rounds, the model H(x)H(x) is sparse in the feature space, minimizing loss subject to penalized feature complexity (Xu et al., 2019).

2. Sparsity Mechanisms and Structured Priors

GBFS induces sparsity by imposing a cost μ\mu only when a feature is introduced for the first time—subsequent uses are “free.” This design encourages each tree to reuse features whenever possible, while weak or redundant features are filtered out unless they offer substantial objective reduction. Beyond simple sparsity, GBFS accommodates arbitrary structured sparsity patterns:

  • Feature-group structures (“bags”) can be encoded by letting ϕf(Ω)\phi_f(\Omega) vanish for all features in a group once any member is selected.
  • Side-information functions allow the formulation of custom cost functions for incorporating domain knowledge.

The resulting model is capable of both group sparsity (group Lasso-like behavior) and standard sparsity, without sacrificing nonlinear expressive power.

3. Empirical Performance and Use Cases

Empirical results demonstrate consistent or improved accuracy compared with L1-regularized logistic regression, Random Forest feature selection, mRMR, and HSIC-Lasso in strong feature-sparsity regimes. Illustrative examples include:

  • Synthetic XOR problem: GBFS selects exactly the relevant interacting pair, achieving 0% test error, unlike L1-LR, which fails in this non-linear setting.
  • Gene-expression data (colon dataset): When features are pre-clustered, GBFS focused on a single biologically meaningful cluster yields \sim15% test error with far fewer features, competing with group-Lasso and Random Forest FS.
  • Large-scale settings: On kddcup99 (n5n \approx 5M, d=122d=122), GBFS attains lower error with fewer features than Random Forest FS and L1-LR.

GBFS is well-suited to high-dimensional, low-sample settings (e.g., dnd \gg n, as in certain genomics datasets), with competitive gains over kernel and information-theoretic approaches (Xu et al., 2019).

4. Computational Complexity and Scalability

For a fixed maximum tree depth DD, a single tree fit using standard CART on nn samples and dd features costs O(dnlogn)O(d n \log n), but with pre-sorting and shallow trees (as is typical under strong regularization), the practical cost is often O(dn)O(d n). TT such iterations yield overall complexity O(Tdn)O(T d n). The memory requirements mainly stem from the data matrix (O(nd)O(n d)) and gradient storage (O(n)O(n)).

To address scalability with very large dd, one may employ subsampling, “feature-bagging” heuristics, or the scalable group-testing and binary-search methods developed in (Han et al., 2021), which reduce per-node complexity from O(dn)O(d n) to O(slogslog(d/s)nlogn)O(s \log s \log(d/s) n \log n) with ss the feature budget. This makes GBFS directly applicable to ultrahigh-dimensional regimes with minimal overhead.

GBFS’s core methodology admits several notable extensions:

  • Group Testing–GBM (GT-GBM): For dsd \gg s, group-testing and randomized binary search efficiently identify promising new features without full O(d)O(d) scans (Han et al., 2021).
  • Multitask Selection: Group and task-specific sparsity penalties allow simultaneous discovery of shared and task-specific features.
  • Differentiable Skinny Trees: End-to-end differentiable tree ensembles with group 0\ell_0 penalties enforce global feature budget constraints, enabling proximal gradient training in tensorized architectures with convergence guarantees (Ibrahim et al., 2023).
  • Unbiased Feature Importance: Cross-validation and out-of-bag approaches (e.g., unbiased gain (Zhang et al., 2023), cross-validated splits (Adler et al., 2021)) mitigate high-cardinality feature bias and support more interpretable selection in the presence of categorical variables.
  • False Discovery Control: Integration with stability selection (e.g., IPSSGB (Melikechi et al., 2024)) yields finite-sample bounds on false positive selection, providing feature-wise q-values and supporting high-dimensional inference scenarios.

6. Practical Considerations and Hyperparameter Tuning

Key hyperparameters include:

  • Learning rate α\alpha: typically 0.01–0.1.
  • Boosting iterations TT: a proxy for model capacity; tuned via cross-validation.
  • Tree depth DD: controls the complexity of feature interactions.
  • Feature penalty μ\mu: higher values induce stricter sparsity; critical for the tradeoff between performance and model compression.

Typical practice involves joint cross-validation over (T,D,μ)(T, D, \mu), with row subsampling or parallelization for large nn. Feature-subsampling or group-testing schemes are recommended for very large dd. For structured feature selection (e.g., with feature groups or bags), custom cost functions via ϕf(Ω)\phi_f(\Omega) may be programmed directly into the impurity scoring.

7. Applications and Limitations

GBFS is particularly effective where interpretability, test-time efficiency, and robustness to irrelevant features are critical: bioinformatics (gene selection), neuroscience, imaging, computer vision (e.g., object detection with HOG features), and large-scale click-through modeling.

Principal strengths:

  • Single-stage, joint feature selection and learning.
  • Nonlinear interaction discovery (trees as base learners).
  • Linear-in-(n,d)(n, d) scalability (barring extremely large dd).
  • Support for domain-specific sparsity via custom priors.

Limitations include:

  • Computational overhead in extremely high-dimensional dd without subsampling or group testing.
  • Nonconvexity—solutions are only locally optimal.
  • Feature importance interpretability may be affected in highly collinear or correlated regimes unless care is taken with attribution methods and regularization parameters (Xu et al., 2019, Ibrahim et al., 2023, Adler et al., 2021).

In summary, Gradient Boosted Feature Selection offers a mathematically principled, empirically validated, and extensible toolkit for embedded feature selection in tree ensembles. It unites sparse modeling, nonlinear interaction modeling, and domain-specific priors within a single cost-minimizing framework suitable for modern high-dimensional data analysis.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient Boosted Feature Selection (GBFS).