Lasso-Weighted Random Forests

Updated 12 November 2025

Lasso-weighted random forests are an ensemble method that combines equal-weight random forest averaging with adaptive Lasso penalties to balance bias and variance.
The technique adapts weights to interpolate between aggressive sparsification and uniform aggregation, excelling particularly under moderate signal-to-noise conditions.
Empirical benchmarks demonstrate its versatility across domains, achieving up to 30% improvement in mean squared error and sharper feature selection.

Lasso-weighted random forests—also termed "Lassoed Forests"—are an ensemble learning methodology combining the variance-reducing power of random forests with bias-reduction via sparse convex post-selection, controlled by an adaptive weighted Lasso penalty. This approach seeks to interpolate between the traditional random forest, which uniformly averages an ensemble of high-variance but low-bias regression trees, and post-selection Lasso reweighting, which can aggressively discount weak trees to reduce bias but risks increased variance, especially in low signal-to-noise regimes. By introducing adaptivity in the regularization penalty, Lassoed Forests provide a principled, unified framework that strictly outperforms both standard random forest and fixed-weight Lasso post-selection under moderate signal-to-noise conditions, as established both theoretically and empirically (Shang et al., 10 Nov 2025).

1. Problem Formulation and Estimators

The standard random forest regression model for a feature vector $x \in \mathbb{R}^p$ with $T$ trees $h_j : \mathbb{R}^p \to \mathbb{R}$ predicts via simple averaging: $\hat f_{\mathrm{RF}}(x) = \frac{1}{T} \sum_{j=1}^T h_j(x),$ which is equivalently $\hat{f}_{\mathrm{RF}}(x) = \sum_{j=1}^{T} w^0_j h_j(x)$ with $w^0_j = 1/T$ . While this aggregation reduces variance among trees, it cannot eliminate the bias shared by all $h_j$ .

To reduce estimator bias, a fixed-weight Lasso post-selection forms the prediction as a sparse linear combination of trees, selecting weights by solving

$\hat w = \arg\min_{w \in \mathbb{R}^T} \frac{1}{2n} \left\| y - H w \right\|_2^2 + \lambda \|w\|_1,$

yielding $\hat f_{\mathrm{Lasso}}(x) = \sum_{j=1}^{T} \hat w_j h_j(x)$ , where $H \in \mathbb{R}^{n \times T}$ contains out-of-bag predictions $H_{ij} = h_j(x_i)$ . However, the $\ell_1$ penalty risks discarding too many trees or over-shrinking weights, which can degrade accuracy by raising variance when signal-to-noise ratio (SNR) is low.

The Lassoed Forest introduces an adaptive Lasso penalty to interpolate between equal weighting ( $\lambda=0$ ) and aggressive sparsification. Adaptive weights $\alpha_j$ are defined for an initial estimate $w^{(\mathrm{init})}$ as $\alpha_j = |w^{(\mathrm{init})}_j|^{-\gamma}$ for $\gamma > 0$ , leading to the objective: $\hat w = \arg\min_{w \in \mathbb{R}^T} \frac{1}{2n} \|y - H w\|_2^2 + \lambda \sum_{j=1}^T \alpha_j |w_j|.$

2. Optimization, Algorithms, and KKT Analysis

At optimality, the sub-differential Karush-Kuhn-Tucker (KKT) conditions for the adaptive Lasso are: $\frac{1}{n} h_j^T (H \hat w - y) + \lambda \alpha_j s_j = 0, \quad s_j \in \begin{cases} \{\mathrm{sgn}(\hat w_j)\} & \hat w_j \neq 0, \ [-1,1] & \hat w_j = 0. \end{cases}$ This implies $|\frac{1}{n} h_j^T (H \hat w - y)| \leq \lambda \alpha_j \iff \hat w_j = 0$ .

Efficient optimization proceeds via cyclic coordinate descent. For each $j$ , all $w_k$ ( $k \neq j$ ) are fixed and

$z_j = \frac{1}{n} h_j^T \left( y - \sum_{k \neq j} h_k w_k \right),\qquad w_j \leftarrow \frac{1}{\frac{1}{n} \|h_j\|_2^2} S(z_j, \lambda \alpha_j),$

where $S(z, \tau) = \mathrm{sgn}(z) \max\{|z|-\tau, 0\}$ is soft-thresholding. Strong screening rules (e.g., discarding $j$ where $\|h_j^T y\|_\infty$ is too small) can dramatically reduce computation.

3. Theoretical Properties: Bias-Variance, Oracle Bounds, and SNR

Assuming the true model $y_i = f^*(x_i) + \varepsilon_i$ with sub-Gaussian $\varepsilon$ and that $H$ satisfies a restricted eigenvalue (RE) condition, the analysis focuses on mean squared prediction error. The SNR is defined as

$\mathrm{SNR} = \frac{\mathrm{Var}(f^*(x))}{\sigma^2}.$

For the adaptive Lasso, oracle inequalities show that, with high probability,

$\frac{1}{n} \|H(\hat w - w^*)\|_2^2 \leq C \lambda^2 |S|, \qquad \|\hat w - w^* \|_1 \leq C' \lambda |S|,$

for $S = \mathrm{supp}(w^*)$ , $\lambda = O(\sigma \sqrt{\log T / n})$ , and constants $C, C'$ depending on the RE constant $\kappa$ .

The bias-variance decomposition is

$\mathbb{E}[(\hat f(x) - f^*(x))^2] = (\mathbb{E}\hat f(x) - f^*(x))^2 + \mathrm{Var}(\hat f(x)).$

Vanilla RF typically exhibits large bias but variance decays as $O(1/T)$ . Fixed Lasso reduces bias but risks high variance at low SNR. Adaptive Lasso, by tuning $\alpha_j$ , yields an estimator whose upper bound on MSE strictly interpolates between—and can be smaller than—those of both alternatives for moderate SNR, as formalized: $\mathbb{E}[(\hat f_{\mathrm{ada}}(x) - f^*(x))^2] < \min\{\mathbb{E}[(\hat f_{\mathrm{RF}}(x) - f^*(x))^2], \mathbb{E}[(\hat f_{\mathrm{Lasso}}(x) - f^*(x))^2]\}.$

4. Empirical Evaluation and Benchmarking

Simulation experiments considered polynomial and tree-ensemble generative models. Test metrics included MSE, bias/variance decomposition over replicates, and OOB/CV error estimation. Hyperparameters $\lambda$ and $\gamma$ were selected via cross-validation with grid search over $\gamma \in \{0.5,1,2\}$ .

Empirical findings included:

Low SNR regime: vanilla RF outperforms fixed Lasso post-selection
High SNR: Lasso post-selection outperforms vanilla RF by up to 20 %
Adaptive Lassoed Forest tracks the better of the two uniformly, providing up to 30 % improvement over the less suitable method at moderate SNR

Variable importance was quantified using weighted split counts: $\kappa_s = \sum_{j=1}^T |\hat w_j| \times \frac{\#\{\text{splits on feature } s \text{ in tree } j\}}{\sum_{r=1}^p \#\{\text{splits on } r\}},$ with adaptively weighted forests producing sharper separation of true signal features.

In real-world case studies:

Domain & Adaptive Forest Performance	Notes
California Housing	Recovers most post-selection gain (5–10 %) without loss at low SNR
Spam classification	Maximum loss vs. best baseline is ≤1% error
Drug response prediction	Lower MSE than RF and Lasso on 5/6 drugs
Survival / Binary clinical	Higher CDI c-index / lower misclassification

5. Practical Implementation Procedures

Lassoed Forests are trained using a cross-fitted workflow to prevent over-optimistic bias in out-of-bag or cross-validation error estimates. The high-level procedure is as follows:

Split dataset $(X, y)$ into disjoint halves $D_1, D_2$ .
On $D_1$ , grow $T$ trees via bootstrap; obtain OOB predictions on $D_2$ to form matrix $H$ .
Fit a fixed-weight Lasso on $(H, y_{D_2})$ to generate $w^{(\mathrm{init})}$ , selecting the regularization $\lambda_0$ by cross-validation.
For grid of candidate $\lambda$ , set adaptive weights $\alpha_j = |w^{(\mathrm{init})}_j|^{-\gamma}$ , fit adaptive Lasso, and estimate cross-validated error.
Select $\hat\lambda$ minimizing CV error, with weights $\hat w$ at this solution.
The final prediction is $\hat f(x) = \sum_{j=1}^T \hat w_j h_j(x)$ .

Computational considerations:

Tree fitting scales as $O(T n \log n)$ .
Lasso regression via coordinate descent is $O(L T n)$ per cross-validation fold for $L$ $\lambda$ -values.
Feature screening and warm starting can reduce effective complexity to $O(T n)$ even for large $T$ (up to $10^4$ ).
In R, glmnet with per-variable penalties implements adaptive Lasso; in Python, the trick is to use sklearn.linear_model.Lasso and encode $\alpha_j$ via sample_weight.
For very large $T$ , sparsity can be exploited by representing $H$ by only nonzero OOB entries. Parallel coordinate descent and warm starts accelerate the adaptive Lasso solution path.

6. Interpretive Perspectives and Methodological Significance

The Lassoed Forest framework synthesizes the strengths and mitigates the core weaknesses of both bagging and convex post-selection in tree ensembles. Explicit dependence on SNR determines which regime—the high-variance, low-bias averaging or low-variance, high-bias selection—dominates performance. The adaptive penalty yields a smooth transition, and mathematical guarantees under standard RE and sub-Gaussian assumptions establish strict improvement in predictive risk for moderate SNR. This suggests Lassoed Forests are most beneficial when the true function's variance and noise levels are comparable, and in settings requiring both feature selection interpretability and robust out-of-sample prediction.

The use of weighted split counts for variable importance, conditional on forest weights, provides a tool for causal and feature attribution analyses with enhanced separation of signal features. The modular nature of the post-selection procedure also allows for deployment with alternative ensemble architectures and further regularization frameworks.

A plausible implication is that the Lassoed Forest methodology motivates further exploration of adaptive penalties and staged model selection in nonparametric ensemble learning, especially as model sizes and data scales continue to increase.

PDF Markdown Chat (Pro)

References (1)

Lassoed Forests: Random Forests with Adaptive Lasso Post-selection (2025)

Follow Topic

Get notified by email when new papers are published related to Lasso-Weighted Random Forests.