Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Lasso-Weighted Random Forests

Updated 12 November 2025
  • Lasso-weighted random forests are an ensemble method that combines equal-weight random forest averaging with adaptive Lasso penalties to balance bias and variance.
  • The technique adapts weights to interpolate between aggressive sparsification and uniform aggregation, excelling particularly under moderate signal-to-noise conditions.
  • Empirical benchmarks demonstrate its versatility across domains, achieving up to 30% improvement in mean squared error and sharper feature selection.

Lasso-weighted random forests—also termed "Lassoed Forests"—are an ensemble learning methodology combining the variance-reducing power of random forests with bias-reduction via sparse convex post-selection, controlled by an adaptive weighted Lasso penalty. This approach seeks to interpolate between the traditional random forest, which uniformly averages an ensemble of high-variance but low-bias regression trees, and post-selection Lasso reweighting, which can aggressively discount weak trees to reduce bias but risks increased variance, especially in low signal-to-noise regimes. By introducing adaptivity in the regularization penalty, Lassoed Forests provide a principled, unified framework that strictly outperforms both standard random forest and fixed-weight Lasso post-selection under moderate signal-to-noise conditions, as established both theoretically and empirically (Shang et al., 10 Nov 2025).

1. Problem Formulation and Estimators

The standard random forest regression model for a feature vector xRpx \in \mathbb{R}^p with TT trees hj:RpRh_j : \mathbb{R}^p \to \mathbb{R} predicts via simple averaging: f^RF(x)=1Tj=1Thj(x),\hat f_{\mathrm{RF}}(x) = \frac{1}{T} \sum_{j=1}^T h_j(x), which is equivalently f^RF(x)=j=1Twj0hj(x)\hat{f}_{\mathrm{RF}}(x) = \sum_{j=1}^{T} w^0_j h_j(x) with wj0=1/Tw^0_j = 1/T. While this aggregation reduces variance among trees, it cannot eliminate the bias shared by all hjh_j.

To reduce estimator bias, a fixed-weight Lasso post-selection forms the prediction as a sparse linear combination of trees, selecting weights by solving

w^=argminwRT12nyHw22+λw1,\hat w = \arg\min_{w \in \mathbb{R}^T} \frac{1}{2n} \left\| y - H w \right\|_2^2 + \lambda \|w\|_1,

yielding f^Lasso(x)=j=1Tw^jhj(x)\hat f_{\mathrm{Lasso}}(x) = \sum_{j=1}^{T} \hat w_j h_j(x), where HRn×TH \in \mathbb{R}^{n \times T} contains out-of-bag predictions Hij=hj(xi)H_{ij} = h_j(x_i). However, the 1\ell_1 penalty risks discarding too many trees or over-shrinking weights, which can degrade accuracy by raising variance when signal-to-noise ratio (SNR) is low.

The Lassoed Forest introduces an adaptive Lasso penalty to interpolate between equal weighting (λ=0\lambda=0) and aggressive sparsification. Adaptive weights αj\alpha_j are defined for an initial estimate w(init)w^{(\mathrm{init})} as αj=wj(init)γ\alpha_j = |w^{(\mathrm{init})}_j|^{-\gamma} for γ>0\gamma > 0, leading to the objective: w^=argminwRT12nyHw22+λj=1Tαjwj.\hat w = \arg\min_{w \in \mathbb{R}^T} \frac{1}{2n} \|y - H w\|_2^2 + \lambda \sum_{j=1}^T \alpha_j |w_j|.

2. Optimization, Algorithms, and KKT Analysis

At optimality, the sub-differential Karush-Kuhn-Tucker (KKT) conditions for the adaptive Lasso are: 1nhjT(Hw^y)+λαjsj=0,sj{{sgn(w^j)}w^j0, [1,1]w^j=0.\frac{1}{n} h_j^T (H \hat w - y) + \lambda \alpha_j s_j = 0, \quad s_j \in \begin{cases} \{\mathrm{sgn}(\hat w_j)\} & \hat w_j \neq 0, \ [-1,1] & \hat w_j = 0. \end{cases} This implies 1nhjT(Hw^y)λαj    w^j=0|\frac{1}{n} h_j^T (H \hat w - y)| \leq \lambda \alpha_j \iff \hat w_j = 0.

Efficient optimization proceeds via cyclic coordinate descent. For each jj, all wkw_k (kjk \neq j) are fixed and

zj=1nhjT(ykjhkwk),wj11nhj22S(zj,λαj),z_j = \frac{1}{n} h_j^T \left( y - \sum_{k \neq j} h_k w_k \right),\qquad w_j \leftarrow \frac{1}{\frac{1}{n} \|h_j\|_2^2} S(z_j, \lambda \alpha_j),

where S(z,τ)=sgn(z)max{zτ,0}S(z, \tau) = \mathrm{sgn}(z) \max\{|z|-\tau, 0\} is soft-thresholding. Strong screening rules (e.g., discarding jj where hjTy\|h_j^T y\|_\infty is too small) can dramatically reduce computation.

3. Theoretical Properties: Bias-Variance, Oracle Bounds, and SNR

Assuming the true model yi=f(xi)+εiy_i = f^*(x_i) + \varepsilon_i with sub-Gaussian ε\varepsilon and that HH satisfies a restricted eigenvalue (RE) condition, the analysis focuses on mean squared prediction error. The SNR is defined as

SNR=Var(f(x))σ2.\mathrm{SNR} = \frac{\mathrm{Var}(f^*(x))}{\sigma^2}.

For the adaptive Lasso, oracle inequalities show that, with high probability,

1nH(w^w)22Cλ2S,w^w1CλS,\frac{1}{n} \|H(\hat w - w^*)\|_2^2 \leq C \lambda^2 |S|, \qquad \|\hat w - w^* \|_1 \leq C' \lambda |S|,

for S=supp(w)S = \mathrm{supp}(w^*), λ=O(σlogT/n)\lambda = O(\sigma \sqrt{\log T / n}), and constants C,CC, C' depending on the RE constant κ\kappa.

The bias-variance decomposition is

E[(f^(x)f(x))2]=(Ef^(x)f(x))2+Var(f^(x)).\mathbb{E}[(\hat f(x) - f^*(x))^2] = (\mathbb{E}\hat f(x) - f^*(x))^2 + \mathrm{Var}(\hat f(x)).

Vanilla RF typically exhibits large bias but variance decays as O(1/T)O(1/T). Fixed Lasso reduces bias but risks high variance at low SNR. Adaptive Lasso, by tuning αj\alpha_j, yields an estimator whose upper bound on MSE strictly interpolates between—and can be smaller than—those of both alternatives for moderate SNR, as formalized: E[(f^ada(x)f(x))2]<min{E[(f^RF(x)f(x))2],E[(f^Lasso(x)f(x))2]}.\mathbb{E}[(\hat f_{\mathrm{ada}}(x) - f^*(x))^2] < \min\{\mathbb{E}[(\hat f_{\mathrm{RF}}(x) - f^*(x))^2], \mathbb{E}[(\hat f_{\mathrm{Lasso}}(x) - f^*(x))^2]\}.

4. Empirical Evaluation and Benchmarking

Simulation experiments considered polynomial and tree-ensemble generative models. Test metrics included MSE, bias/variance decomposition over replicates, and OOB/CV error estimation. Hyperparameters λ\lambda and γ\gamma were selected via cross-validation with grid search over γ{0.5,1,2}\gamma \in \{0.5,1,2\}.

Empirical findings included:

  • Low SNR regime: vanilla RF outperforms fixed Lasso post-selection
  • High SNR: Lasso post-selection outperforms vanilla RF by up to 20 %
  • Adaptive Lassoed Forest tracks the better of the two uniformly, providing up to 30 % improvement over the less suitable method at moderate SNR

Variable importance was quantified using weighted split counts: κs=j=1Tw^j×#{splits on feature s in tree j}r=1p#{splits on r},\kappa_s = \sum_{j=1}^T |\hat w_j| \times \frac{\#\{\text{splits on feature } s \text{ in tree } j\}}{\sum_{r=1}^p \#\{\text{splits on } r\}}, with adaptively weighted forests producing sharper separation of true signal features.

In real-world case studies:

Domain & Adaptive Forest Performance Notes
California Housing Recovers most post-selection gain (5–10 %) without loss at low SNR
Spam classification Maximum loss vs. best baseline is ≤1% error
Drug response prediction Lower MSE than RF and Lasso on 5/6 drugs
Survival / Binary clinical Higher CDI c-index / lower misclassification

5. Practical Implementation Procedures

Lassoed Forests are trained using a cross-fitted workflow to prevent over-optimistic bias in out-of-bag or cross-validation error estimates. The high-level procedure is as follows:

  1. Split dataset (X,y)(X, y) into disjoint halves D1,D2D_1, D_2.
  2. On D1D_1, grow TT trees via bootstrap; obtain OOB predictions on D2D_2 to form matrix HH.
  3. Fit a fixed-weight Lasso on (H,yD2)(H, y_{D_2}) to generate w(init)w^{(\mathrm{init})}, selecting the regularization λ0\lambda_0 by cross-validation.
  4. For grid of candidate λ\lambda, set adaptive weights αj=wj(init)γ\alpha_j = |w^{(\mathrm{init})}_j|^{-\gamma}, fit adaptive Lasso, and estimate cross-validated error.
  5. Select λ^\hat\lambda minimizing CV error, with weights w^\hat w at this solution.
  6. The final prediction is f^(x)=j=1Tw^jhj(x)\hat f(x) = \sum_{j=1}^T \hat w_j h_j(x).

Computational considerations:

  • Tree fitting scales as O(Tnlogn)O(T n \log n).
  • Lasso regression via coordinate descent is O(LTn)O(L T n) per cross-validation fold for LL λ\lambda-values.
  • Feature screening and warm starting can reduce effective complexity to O(Tn)O(T n) even for large TT (up to 10410^4).
  • In R, glmnet with per-variable penalties implements adaptive Lasso; in Python, the trick is to use sklearn.linear_model.Lasso and encode αj\alpha_j via sample_weight.
  • For very large TT, sparsity can be exploited by representing HH by only nonzero OOB entries. Parallel coordinate descent and warm starts accelerate the adaptive Lasso solution path.

6. Interpretive Perspectives and Methodological Significance

The Lassoed Forest framework synthesizes the strengths and mitigates the core weaknesses of both bagging and convex post-selection in tree ensembles. Explicit dependence on SNR determines which regime—the high-variance, low-bias averaging or low-variance, high-bias selection—dominates performance. The adaptive penalty yields a smooth transition, and mathematical guarantees under standard RE and sub-Gaussian assumptions establish strict improvement in predictive risk for moderate SNR. This suggests Lassoed Forests are most beneficial when the true function's variance and noise levels are comparable, and in settings requiring both feature selection interpretability and robust out-of-sample prediction.

The use of weighted split counts for variable importance, conditional on forest weights, provides a tool for causal and feature attribution analyses with enhanced separation of signal features. The modular nature of the post-selection procedure also allows for deployment with alternative ensemble architectures and further regularization frameworks.

A plausible implication is that the Lassoed Forest methodology motivates further exploration of adaptive penalties and staged model selection in nonparametric ensemble learning, especially as model sizes and data scales continue to increase.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Lasso-Weighted Random Forests.