Gradient Boosted Feature Selection (GBFS)

Updated 10 February 2026

GBFS is a nonlinear, scalable method that embeds feature selection within gradient-boosted tree ensembles by penalizing the introduction of new features.
It modifies the tree-building impurity criterion with a fixed penalty for new features, encouraging reuse and enabling the capture of non-linear interactions.
GBFS demonstrates competitive performance on high-dimensional data by achieving sparse, interpretable models with improved computational efficiency over traditional methods.

Gradient Boosted Feature Selection (GBFS) is a nonlinear, scalable framework for embedded feature selection in gradient-boosted tree ensembles. By regularizing the introduction of new features in the boosting process, GBFS isolates a compact, interpretable, and high-performing subset of features, naturally capturing non-linear interactions and supporting complex structured sparsity regimes. It directly modifies the tree-building impurity criterion in gradient boosting to penalize the use of previously unseen features, resulting in models that are both statistically sparse and practically efficient, with performance that is competitive with or surpasses state-of-the-art filter and wrapper approaches on high-dimensional, real-world data (Xu et al., 2019).

1. Mathematical and Algorithmic Principles

Let $\{(x_i, y_i)\}_{i=1}^n$ with $x_i \in \mathbb{R}^d$ and $y_i \in \{-1, +1\}$ denote the training data. The model is an additive ensemble: $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ where each $h_t$ is a regression tree from a hypothesis class $\mathcal{H}$ , $\beta_t \geq 0$ , and $T$ is the number of boosting rounds. GBFS introduces a binary feature usage matrix $F \in \{0,1\}^{d \times T}$ , with $F_{f t}$ indicating whether feature $x_i \in \mathbb{R}^d$ 0 is used in tree $x_i \in \mathbb{R}^d$ 1.

To induce sparsity, GBFS uses a capped- $x_i \in \mathbb{R}^d$ 2 surrogate

$x_i \in \mathbb{R}^d$ 3

and solves the non-convex optimization problem: $x_i \in \mathbb{R}^d$ 4 where $x_i \in \mathbb{R}^d$ 5 is a loss function (e.g., logistic or squared loss), $x_i \in \mathbb{R}^d$ 6 (often set to the boosting step size $x_i \in \mathbb{R}^d$ 7), and $x_i \in \mathbb{R}^d$ 8 controls the per-new-feature penalty. This construction ensures that each new feature incurs a fixed cost, while repeated splits on already-introduced features are unpenalized.

Model optimization is performed via a variant of coordinate descent in function space, i.e., a gradient boosting procedure. At each boosting iteration:

Negative gradients are computed.
Unused features are tracked via $x_i \in \mathbb{R}^d$ 9.
The current tree is trained to minimize

$y_i \in \{-1, +1\}$ 0

effectively adding $y_i \in \{-1, +1\}$ 1 to the split impurity score whenever a new feature is introduced.

After $y_i \in \{-1, +1\}$ 2 rounds, the model $y_i \in \{-1, +1\}$ 3 is sparse in the feature space, minimizing loss subject to penalized feature complexity (Xu et al., 2019).

2. Sparsity Mechanisms and Structured Priors

GBFS induces sparsity by imposing a cost $y_i \in \{-1, +1\}$ 4 only when a feature is introduced for the first time—subsequent uses are “free.” This design encourages each tree to reuse features whenever possible, while weak or redundant features are filtered out unless they offer substantial objective reduction. Beyond simple sparsity, GBFS accommodates arbitrary structured sparsity patterns:

Feature-group structures (“bags”) can be encoded by letting $y_i \in \{-1, +1\}$ 5 vanish for all features in a group once any member is selected.
Side-information functions allow the formulation of custom cost functions for incorporating domain knowledge.

The resulting model is capable of both group sparsity (group Lasso-like behavior) and standard sparsity, without sacrificing nonlinear expressive power.

3. Empirical Performance and Use Cases

Empirical results demonstrate consistent or improved accuracy compared with L1-regularized logistic regression, Random Forest feature selection, mRMR, and HSIC-Lasso in strong feature-sparsity regimes. Illustrative examples include:

Synthetic XOR problem: GBFS selects exactly the relevant interacting pair, achieving 0% test error, unlike L1-LR, which fails in this non-linear setting.
Gene-expression data (colon dataset): When features are pre-clustered, GBFS focused on a single biologically meaningful cluster yields $y_i \in \{-1, +1\}$ 615% test error with far fewer features, competing with group-Lasso and Random Forest FS.
Large-scale settings: On kddcup99 ( $y_i \in \{-1, +1\}$ 7M, $y_i \in \{-1, +1\}$ 8), GBFS attains lower error with fewer features than Random Forest FS and L1-LR.

GBFS is well-suited to high-dimensional, low-sample settings (e.g., $y_i \in \{-1, +1\}$ 9, as in certain genomics datasets), with competitive gains over kernel and information-theoretic approaches (Xu et al., 2019).

4. Computational Complexity and Scalability

For a fixed maximum tree depth $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 0, a single tree fit using standard CART on $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 1 samples and $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 2 features costs $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 3, but with pre-sorting and shallow trees (as is typical under strong regularization), the practical cost is often $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 4. $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 5 such iterations yield overall complexity $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 6. The memory requirements mainly stem from the data matrix ( $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 7) and gradient storage ( $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 8).

To address scalability with very large $H(x) = \sum_{t=1}^T \beta_t h_t(x),$ 9, one may employ subsampling, “feature-bagging” heuristics, or the scalable group-testing and binary-search methods developed in (Han et al., 2021), which reduce per-node complexity from $h_t$ 0 to $h_t$ 1 with $h_t$ 2 the feature budget. This makes GBFS directly applicable to ultrahigh-dimensional regimes with minimal overhead.

GBFS’s core methodology admits several notable extensions:

Group Testing–GBM (GT-GBM): For $h_t$ 3, group-testing and randomized binary search efficiently identify promising new features without full $h_t$ 4 scans (Han et al., 2021).
Multitask Selection: Group and task-specific sparsity penalties allow simultaneous discovery of shared and task-specific features.
Differentiable Skinny Trees: End-to-end differentiable tree ensembles with group $h_t$ 5 penalties enforce global feature budget constraints, enabling proximal gradient training in tensorized architectures with convergence guarantees (Ibrahim et al., 2023).
Unbiased Feature Importance: Cross-validation and out-of-bag approaches (e.g., unbiased gain (Zhang et al., 2023), cross-validated splits (Adler et al., 2021)) mitigate high-cardinality feature bias and support more interpretable selection in the presence of categorical variables.
False Discovery Control: Integration with stability selection (e.g., IPSSGB (Melikechi et al., 2024)) yields finite-sample bounds on false positive selection, providing feature-wise q-values and supporting high-dimensional inference scenarios.

6. Practical Considerations and Hyperparameter Tuning

Key hyperparameters include:

Learning rate $h_t$ 6: typically 0.01–0.1.
Boosting iterations $h_t$ 7: a proxy for model capacity; tuned via cross-validation.
Tree depth $h_t$ 8: controls the complexity of feature interactions.
Feature penalty $h_t$ 9: higher values induce stricter sparsity; critical for the tradeoff between performance and model compression.

Typical practice involves joint cross-validation over $\mathcal{H}$ 0, with row subsampling or parallelization for large $\mathcal{H}$ 1. Feature-subsampling or group-testing schemes are recommended for very large $\mathcal{H}$ 2. For structured feature selection (e.g., with feature groups or bags), custom cost functions via $\mathcal{H}$ 3 may be programmed directly into the impurity scoring.

7. Applications and Limitations

GBFS is particularly effective where interpretability, test-time efficiency, and robustness to irrelevant features are critical: bioinformatics (gene selection), neuroscience, imaging, computer vision (e.g., object detection with HOG features), and large-scale click-through modeling.

Principal strengths:

Single-stage, joint feature selection and learning.
Nonlinear interaction discovery (trees as base learners).
Linear-in- $\mathcal{H}$ 4 scalability (barring extremely large $\mathcal{H}$ 5).
Support for domain-specific sparsity via custom priors.

Limitations include:

Computational overhead in extremely high-dimensional $\mathcal{H}$ 6 without subsampling or group testing.
Nonconvexity—solutions are only locally optimal.
Feature importance interpretability may be affected in highly collinear or correlated regimes unless care is taken with attribution methods and regularization parameters (Xu et al., 2019, Ibrahim et al., 2023, Adler et al., 2021).

In summary, Gradient Boosted Feature Selection offers a mathematically principled, empirically validated, and extensible toolkit for embedded feature selection in tree ensembles. It unites sparse modeling, nonlinear interaction modeling, and domain-specific priors within a single cost-minimizing framework suitable for modern high-dimensional data analysis.

Markdown Report Issue Upgrade to Chat

References (6)

Gradient Boosted Feature Selection (2019)

Scalable Feature Selection for (Multitask) Gradient Boosted Trees (2021)

End-to-end Feature Selection Approach for Learning Skinny Trees (2023)

Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance (2023)

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection (2021)

Nonparametric IPSS: Fast, flexible feature selection with false discovery control (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient Boosted Feature Selection (GBFS).

Gradient Boosted Feature Selection (GBFS)

1. Mathematical and Algorithmic Principles

2. Sparsity Mechanisms and Structured Priors

3. Empirical Performance and Use Cases

4. Computational Complexity and Scalability

6. Practical Considerations and Hyperparameter Tuning

7. Applications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Gradient Boosted Feature Selection (GBFS)

1. Mathematical and Algorithmic Principles

2. Sparsity Mechanisms and Structured Priors

3. Empirical Performance and Use Cases

4. Computational Complexity and Scalability

5. Extensions and Related Frameworks

6. Practical Considerations and Hyperparameter Tuning

7. Applications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research