Gradient Boosted Feature Selection (GBFS)
- GBFS is a nonlinear, scalable method that embeds feature selection within gradient-boosted tree ensembles by penalizing the introduction of new features.
- It modifies the tree-building impurity criterion with a fixed penalty for new features, encouraging reuse and enabling the capture of non-linear interactions.
- GBFS demonstrates competitive performance on high-dimensional data by achieving sparse, interpretable models with improved computational efficiency over traditional methods.
Gradient Boosted Feature Selection (GBFS) is a nonlinear, scalable framework for embedded feature selection in gradient-boosted tree ensembles. By regularizing the introduction of new features in the boosting process, GBFS isolates a compact, interpretable, and high-performing subset of features, naturally capturing non-linear interactions and supporting complex structured sparsity regimes. It directly modifies the tree-building impurity criterion in gradient boosting to penalize the use of previously unseen features, resulting in models that are both statistically sparse and practically efficient, with performance that is competitive with or surpasses state-of-the-art filter and wrapper approaches on high-dimensional, real-world data (Xu et al., 2019).
1. Mathematical and Algorithmic Principles
Let with and denote the training data. The model is an additive ensemble: where each is a regression tree from a hypothesis class , , and is the number of boosting rounds. GBFS introduces a binary feature usage matrix , with indicating whether feature is used in tree .
To induce sparsity, GBFS uses a capped- surrogate
and solves the non-convex optimization problem: where is a loss function (e.g., logistic or squared loss), (often set to the boosting step size ), and controls the per-new-feature penalty. This construction ensures that each new feature incurs a fixed cost, while repeated splits on already-introduced features are unpenalized.
Model optimization is performed via a variant of coordinate descent in function space, i.e., a gradient boosting procedure. At each boosting iteration:
- Negative gradients are computed.
- Unused features are tracked via .
- The current tree is trained to minimize
$\frac{1}{2} \sum_{i=1}^n (g_i - h(x_i))^2 + \mu \sum_{f=1}^d \phi_f \indicator{\text{%%%%17%%%% uses %%%%18%%%%}},$
effectively adding to the split impurity score whenever a new feature is introduced.
After rounds, the model is sparse in the feature space, minimizing loss subject to penalized feature complexity (Xu et al., 2019).
2. Sparsity Mechanisms and Structured Priors
GBFS induces sparsity by imposing a cost only when a feature is introduced for the first time—subsequent uses are “free.” This design encourages each tree to reuse features whenever possible, while weak or redundant features are filtered out unless they offer substantial objective reduction. Beyond simple sparsity, GBFS accommodates arbitrary structured sparsity patterns:
- Feature-group structures (“bags”) can be encoded by letting vanish for all features in a group once any member is selected.
- Side-information functions allow the formulation of custom cost functions for incorporating domain knowledge.
The resulting model is capable of both group sparsity (group Lasso-like behavior) and standard sparsity, without sacrificing nonlinear expressive power.
3. Empirical Performance and Use Cases
Empirical results demonstrate consistent or improved accuracy compared with L1-regularized logistic regression, Random Forest feature selection, mRMR, and HSIC-Lasso in strong feature-sparsity regimes. Illustrative examples include:
- Synthetic XOR problem: GBFS selects exactly the relevant interacting pair, achieving 0% test error, unlike L1-LR, which fails in this non-linear setting.
- Gene-expression data (colon dataset): When features are pre-clustered, GBFS focused on a single biologically meaningful cluster yields 15% test error with far fewer features, competing with group-Lasso and Random Forest FS.
- Large-scale settings: On kddcup99 (M, ), GBFS attains lower error with fewer features than Random Forest FS and L1-LR.
GBFS is well-suited to high-dimensional, low-sample settings (e.g., , as in certain genomics datasets), with competitive gains over kernel and information-theoretic approaches (Xu et al., 2019).
4. Computational Complexity and Scalability
For a fixed maximum tree depth , a single tree fit using standard CART on samples and features costs , but with pre-sorting and shallow trees (as is typical under strong regularization), the practical cost is often . such iterations yield overall complexity . The memory requirements mainly stem from the data matrix () and gradient storage ().
To address scalability with very large , one may employ subsampling, “feature-bagging” heuristics, or the scalable group-testing and binary-search methods developed in (Han et al., 2021), which reduce per-node complexity from to with the feature budget. This makes GBFS directly applicable to ultrahigh-dimensional regimes with minimal overhead.
5. Extensions and Related Frameworks
GBFS’s core methodology admits several notable extensions:
- Group Testing–GBM (GT-GBM): For , group-testing and randomized binary search efficiently identify promising new features without full scans (Han et al., 2021).
- Multitask Selection: Group and task-specific sparsity penalties allow simultaneous discovery of shared and task-specific features.
- Differentiable Skinny Trees: End-to-end differentiable tree ensembles with group penalties enforce global feature budget constraints, enabling proximal gradient training in tensorized architectures with convergence guarantees (Ibrahim et al., 2023).
- Unbiased Feature Importance: Cross-validation and out-of-bag approaches (e.g., unbiased gain (Zhang et al., 2023), cross-validated splits (Adler et al., 2021)) mitigate high-cardinality feature bias and support more interpretable selection in the presence of categorical variables.
- False Discovery Control: Integration with stability selection (e.g., IPSSGB (Melikechi et al., 2024)) yields finite-sample bounds on false positive selection, providing feature-wise q-values and supporting high-dimensional inference scenarios.
6. Practical Considerations and Hyperparameter Tuning
Key hyperparameters include:
- Learning rate : typically 0.01–0.1.
- Boosting iterations : a proxy for model capacity; tuned via cross-validation.
- Tree depth : controls the complexity of feature interactions.
- Feature penalty : higher values induce stricter sparsity; critical for the tradeoff between performance and model compression.
Typical practice involves joint cross-validation over , with row subsampling or parallelization for large . Feature-subsampling or group-testing schemes are recommended for very large . For structured feature selection (e.g., with feature groups or bags), custom cost functions via may be programmed directly into the impurity scoring.
7. Applications and Limitations
GBFS is particularly effective where interpretability, test-time efficiency, and robustness to irrelevant features are critical: bioinformatics (gene selection), neuroscience, imaging, computer vision (e.g., object detection with HOG features), and large-scale click-through modeling.
Principal strengths:
- Single-stage, joint feature selection and learning.
- Nonlinear interaction discovery (trees as base learners).
- Linear-in- scalability (barring extremely large ).
- Support for domain-specific sparsity via custom priors.
Limitations include:
- Computational overhead in extremely high-dimensional without subsampling or group testing.
- Nonconvexity—solutions are only locally optimal.
- Feature importance interpretability may be affected in highly collinear or correlated regimes unless care is taken with attribution methods and regularization parameters (Xu et al., 2019, Ibrahim et al., 2023, Adler et al., 2021).
In summary, Gradient Boosted Feature Selection offers a mathematically principled, empirically validated, and extensible toolkit for embedded feature selection in tree ensembles. It unites sparse modeling, nonlinear interaction modeling, and domain-specific priors within a single cost-minimizing framework suitable for modern high-dimensional data analysis.