- The paper shows that randomization reduces variance while increasing bias, with effects quantified using out-of-sample MSE in regression tasks.
- It finds that Random Forests outperform Bagging in low SNR conditions and with correlated covariates, explaining the bias-variance tradeoff.
- It demonstrates that heavy-tailed predictors and irrelevant features can undermine Random Forest performance, favoring less randomized methods.
This paper, "When do Random Forests work?" (2504.12860), investigates the effectiveness of split randomization, the key feature distinguishing Random Forests (RF) from Bagging. It aims to understand when and why adding this randomization improves upon Bagging, which only uses data sampling (bootstrapping). The analysis focuses on regression tasks and uses the out-of-sample Mean Squared Error (MSE) as the primary performance metric.
Framework and Methodology
- Estimators: The paper defines Bagging and Random Forest estimators as averages of B trees:
- Bagging: Tˉn∗(x)=B1b=1∑BTn,b∗(x) (randomness
*
from bootstrap samples)
- Random Forest: Tˉn,m∗†(x)=B1b=1∑BTn,m,b∗†(x) (randomness
*
from bootstrap, †
from random feature subset selection of size m
at each split).
- Performance Metric: Unconditional (out-of-sample) MSE:
MSE(f^):=E[(Y−f^(X))2], where expectation is over training data Dn (which determines f^) and test data (X,Y).
- Bias-Variance Decomposition: The paper explicitly uses the decomposition obtained by conditioning on the test point X first (Proposition 1):
MSE=E[(f(X)−E[f^(X)∣X])2]+E[Var[f^(X)∣X]]+σε2
This decomposes MSE into average squared bias (conditional on X), average variance (conditional on X), and the irreducible error σε2. The paper notes this decomposition isn't unique; conditioning on the estimator f^ first yields different bias/variance terms.
- Decorrelation Mechanism: It revisits the standard explanation for RF's variance reduction (Proposition 2). For large B, the variance of the ensemble at point x is approximately:
Var[f^(x)]≈Corr[f^1(x),f^2(x)] Var[f^1(x)]
Randomization (†
) aims to reduce the correlation term Corr[f^1(x),f^2(x)] compared to Bagging. However, it can also increase the bias E[f^(x)]=E[f^1(x)] and affect the individual tree variance Var[f^1(x)].
- Normalization and SNR: To compare performance across different data generating processes (DGPs), the regression function f(X) is normalized to have variance 1 (σf2=Var(f(X))=1). The Signal-to-Noise Ratio (SNR) is defined as SNR=σf2/σε2.
- Relative Performance: Comparisons use the relative difference in MSE:
Δr=(MSE(bagging)−MSE(forest))/MSE(forest). This metric is shown to be invariant to the normalization of f(X) for a fixed SNR (Proposition 3).
- Simulation Setup: The analysis relies on simulations using three DGPs from prior literature (LINEAR, MARS, HIDDEN) with varying characteristics. It uses the R
randomForest
package with default settings (500 trees, minimum node size 5) unless specified. Statistical significance of MSE differences is assessed using a t-statistic.
Review of Prior Explanations (SNR Focus)
The paper first revisits and refines existing explanations related to SNR:
- Decorrelation vs. SNR: Randomization consistently reduces tree correlation across low, moderate, and high SNR. The decorrelation effect (reduction in correlation) is often stronger in high SNR, not weaker. Therefore, decorrelation alone doesn't explain why RF performance relative to Bagging varies with SNR.
- Bias-Variance Tradeoff vs. SNR:
- Randomization generally increases the squared bias term.
- Randomization generally reduces the variance term (due to decorrelation outweighing potential changes in single-tree variance).
- Crucially, for both Bagging and RF, variance dominates the MSE in low SNR scenarios, while squared bias becomes relatively more important in high SNR scenarios.
- Conclusion on SNR: Random Forests tend to outperform Bagging in low SNR because the variance reduction achieved through randomization is more impactful than the bias increase, given that variance is the dominant component of the error for both methods. In high SNR, the increased bias from randomization can dominate the variance reduction, potentially making Bagging perform better. The "hidden pattern" model (2401.16129) is noted as an exception where RF can reduce bias even in high SNR.
New Findings Beyond SNR (Fixed Moderate SNR)
The paper then explores how other data characteristics affect the RF vs. Bagging comparison, holding SNR fixed and moderate (SNR=1):
- Covariate Tails:
- Changing covariate distributions from Uniform (bounded support) to Normal (unbounded tails) can significantly impact relative performance.
- Finding: Randomization can substantially increase bias in the tail regions of covariates (e.g., for the N-MARS model). This happens because randomization might prevent splitting sufficiently in directions corresponding to tail regions, leading to large prediction cells with few observations, thus limiting bias reduction.
- Implication: In datasets with heavy-tailed predictors, RF might perform worse than Bagging due to increased bias, especially for predictions in those tail regions.
- Irrelevant Covariates:
- Adding irrelevant covariates (features unrelated to Y) increases the MSE for both Bagging and RF, primarily by increasing the bias term. Both methods can mistakenly split on irrelevant features due to noise.
- Finding: Randomization further increases bias in the presence of irrelevant covariates because it increases the chance of selecting an irrelevant feature at splits where the relevant ones are excluded. Since bias becomes a larger component of the error, the negative impact of randomization on bias often outweighs its positive impact on variance.
- Implication: Bagging tends to outperform RF when many irrelevant covariates are present. The optimal
mtry
(number of features to try) might need to be higher (closer to Bagging) in such scenarios.
- Correlated Covariates:
- Introducing positive correlation between covariates (using a multivariate normal with constant pairwise correlation ρ) has a significant effect.
- Finding: Correlation reduces the MSE for both Bagging and RF compared to the independent case (ρ=0). This reduction is mainly driven by a decrease in the bias term.
- Finding: While randomization still tends to increase bias relative to Bagging, the overall bias levels are lower in the correlated setting. Since variance reduction from randomization remains effective, and bias is less problematic overall, RF tends to outperform Bagging when covariates are correlated. The relative advantage of RF often increases with the correlation strength ρ.
- Finding: With perfect correlation (ρ=1), Bagging and RF perform identically, as randomization becomes irrelevant (all features provide the same split information).
- Implication: In many real-world datasets where covariates are correlated, RF is likely to be effective. The bias reduction observed for both methods due to correlation is a key finding, suggesting averaging helps reduce bias in correlated settings, not just variance.
- Correlated and Irrelevant Covariates: Correlation can mitigate the negative impact of irrelevant covariates on RF's relative performance. Even with many irrelevant features, if they are also correlated, RF can still outperform Bagging if the correlation is strong enough.
Practical Implications and Implementation Considerations
- Choice between RF and Bagging: The decision depends on data characteristics beyond just SNR.
- Favor Random Forest in low SNR settings, or when covariates are known to be substantially correlated.
- Consider Bagging (or RF with high
mtry
) in high SNR settings, especially if bias is a concern, if covariates have heavy tails, or if there are many known irrelevant features.
- Hyperparameter
mtry
: The optimal level of randomization (mtry
) is context-dependent. The default p/3 might not be optimal. Lower mtry
(more randomization) is beneficial for variance reduction (good in low SNR, correlated data) but can increase bias (bad with tails, irrelevant features).
- Evaluation: Comparing methods using relative MSE on normalized data provides a clearer picture. Be explicit about the bias-variance decomposition used. Small relative differences (e.g., <5%) might not be practically significant, depending on the application. Test statistical significance.
- Bias Reduction via Correlation: The finding that correlation reduces bias for both Bagging and RF suggests ensemble methods inherently benefit from feature dependencies in ways beyond simple variance reduction, particularly impacting bias.
Conclusion
The paper provides a nuanced answer to "When do Random Forests work?". Split randomization is effective when its variance reduction outweighs its potential bias increase. This occurs primarily in low SNR environments (where variance dominates) and, significantly, in the presence of correlated covariates (which reduces overall bias for both methods, making the variance reduction from RF more advantageous). Randomization can be detrimental (relative to Bagging) in high SNR settings, especially when data has heavy-tailed covariate distributions or many irrelevant features, as these conditions exacerbate the bias increase caused by randomization. The paper highlights that covariate structure (tails, irrelevance, correlation) plays a crucial role alongside SNR in determining the effectiveness of Random Forests. The discovery that correlation reduces bias in ensemble methods is presented as a potentially important avenue for future research.