Dice Question Streamline Icon: https://streamlinehq.com

Existence of a moderate-SNR DGP with substantial random forest advantage over bagging

Determine whether there exists a regression data-generating process with moderate signal-to-noise ratio SNR = 1 (in the normalized framework where Var(f(X)) = 1) for which a random forest with split randomization (mtry < p) achieves a relative out-of-sample mean-squared error improvement substantially greater than 5% over bagging (mtry = p), and precisely characterize the properties of such a data-generating process if it exists.

Information Square Streamline Icon: https://streamlinehq.com

Background

In the paper’s replication and extension of prior studies, the authors consistently observe that, at moderate signal-to-noise ratio (SNR = 1), the relative MSE differences between random forests and bagging are either small (around 1–6%) or statistically insignificant across commonly used data-generating processes (linear, MARS, hidden-pattern).

They explicitly pose the question of whether a data-generating process can be found where, at SNR = 1, random forests decisively outperform bagging by a margin much larger than 5%, but report that they could not find such an example. Establishing the existence (or non-existence) of such a DGP would clarify the regimes in which split randomization yields practically substantial gains over bagging.

References

When making this observation, a question we asked was the following: for moderate SNR, can we find a DGP for which forest outperforms bagging by much more than 5%? This would be useful to better understand the advantages of randomization. However, we did not succeed in finding such an example.

When do Random Forests work? (2504.12860 - Revelas et al., 17 Apr 2025) in Section 3 (Literature Replications: Review of Previous Findings), final paragraph