Random Forests: Ensemble Learning Power and Practice

Updated 24 June 2026

Random Forests are ensemble learning methods that aggregate multiple decision trees to enhance predictive accuracy and reduce variance.
They excel with high-dimensional data and maintain strong performance even with noise, making them ideal for diverse real-world applications.
Key mechanisms involve data subsampling and random feature selection to ensure diverse tree structures and lower model correlation.

Random forests are ensemble learning algorithms constructing a collection of randomized decision trees and combining their predictions by averaging (for regression) or majority vote (for classification). Originating with Breiman (2001), random forests have become a canonical nonparametric machine learning method due to their adaptability to high-dimensional data, robustness to noise, and strong empirical performance across tabular and structured data domains. The core mechanism is to inject randomization into both input-data subsampling and the feature-selection process at each tree split, thereby generating an ensemble of diverse weak learners whose aggregation yields a predictive model with substantially reduced variance and strong theoretical guarantees in a variety of settings (Duroux et al., 2016, O'Connell, 13 Feb 2026, Scornet et al., 2014).

1. Random Forest Construction and Fundamental Principles

A random forest aggregates a set of $T$ decision trees, each grown independently on a subsample (usually via bootstrap or without replacement) of the training data $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ . At each split node within a tree, a random subset of $m_\mathrm{try}$ features is selected, and among these candidates, the optimal split is chosen according to a node-impurity criterion (e.g., Gini impurity for classification, variance reduction for regression).

Let $h_t(x; \alpha, d)$ denote the prediction of the $t$ th tree at $x$ , built with subsampling rate $\alpha \in (0,1]$ (so that each tree is trained on $a_n = \lfloor \alpha n \rfloor$ observations) and maximum depth $d$ . The forest predictor is

$f_{T, \alpha, d}(x) = \frac{1}{T} \sum_{t=1}^T h_t(x; \alpha, d).$

Randomness enters at two key points: (1) subsample selection for each tree (bagging) and (2) feature subset sampling at each split. The result is a Monte Carlo average over explicit randomization seeds, yielding an estimator that can be written as a data-adaptive weighted average: $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 0 with weights determined by tree structure and ensemble composition (O'Connell, 13 Feb 2026, Scornet, 2015).

2. Bias–Variance Decomposition and Design-Based Variance Structure

Random forests' predictive error decomposes into bias, variance, and irreducible noise terms. At a point $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 1, the mean squared error is

$\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 2

where variance is due both to the finite sample and randomization over tree structure.

Critically, in the limit as $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 3, the predictive variance does not vanish but converges to an intrinsic "structural covariance" floor $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 4 induced by the design of the forest. This arises from two mechanisms (O'Connell, 13 Feb 2026):

Observation reuse ( $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 5): Nonzero covariance occurs whenever two trees within the ensemble rely on overlapping training examples for predictions at $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 6.
Partition alignment ( $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 7): Even if trees are trained on disjoint subsamples, the data-adaptive partitioning can lead to similar local structure, increasing correlation between tree predictions.

The total variance for $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 8 trees satisfies

$\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ 9

where only the Monte Carlo component decays as $m_\mathrm{try}$ 0 increases; the structural covariance is irreducible (O'Connell, 13 Feb 2026). Fine-grained decomposition using the law of total variance reveals further contributions from partition instability and propagation of response variance within leaves.

3. Statistical Consistency, Minimax Rates, and Adaptivity

Theoretical analysis of random forests demonstrates $m_\mathrm{try}$ 1-consistency under a wide range of model classes, provided tree complexity and subsample size are tuned to satisfy certain balance conditions. For Breiman's forests, consistency is guaranteed if the number of leaves per tree grows, but not too quickly relative to the subsample size $m_\mathrm{try}$ 2, and if the randomization mechanisms maintain sufficient diversity between tree partitions (Scornet et al., 2014, Biau, 2010, Duroux et al., 2016). In particular:

For additive regression models, fully-grown trees built on subsamples $m_\mathrm{try}$ 3 with $m_\mathrm{try}$ 4 and $m_\mathrm{try}$ 5 yield $m_\mathrm{try}$ 6-consistency (Scornet et al., 2014).
Random forests adapt to sparsity: when only $m_\mathrm{try}$ 7 features are truly informative, the rate of convergence depends only on $m_\mathrm{try}$ 8, not the ambient dimension $m_\mathrm{try}$ 9 (Biau, 2010).
In one-dimensional settings with Lipschitz regression functions, simplified (purely uniformly random) forests achieve the minimax rate $h_t(x; \alpha, d)$ 0, with the variance reduced by a factor $h_t(x; \alpha, d)$ 1 compared to a single tree (Genuer, 2010).

The dominant error terms arise from the trade-off between bias (controlled by tree depth) and variance (inflated by small sample sizes per leaf). For median forests, the mean squared error at $h_t(x; \alpha, d)$ 2 obeys an upper bound

$h_t(x; \alpha, d)$ 3

where $h_t(x; \alpha, d)$ 4 is the number of splits (depth), and $h_t(x; \alpha, d)$ 5 governs exponential bias decay (Duroux et al., 2016).

4. Design Mechanisms: Subsampling, Pruning, Randomization, and Tuning

Random forests' statistical and computational properties are governed by the interplay of several design parameters:

Subsampling rate $h_t(x; \alpha, d)$ 6: Determines the fraction of data used for each tree. Subsampling at $h_t(x; \alpha, d)$ 7– $h_t(x; \alpha, d)$ 8 achieves error comparable to default bootstrap, with lower computational cost, and can occasionally outperform it by reducing variance (Duroux et al., 2016).
Tree depth $h_t(x; \alpha, d)$ 9 or pruning: Shallow trees (small $t$ 0) are high-bias, low-variance; fully grown trees are low-bias, high-variance. Pruning to limit the number of terminal nodes to $t$ 1– $t$ 2 frequently minimizes generalization error.
Feature subsampling (mtry $t$ 3): Governs partition diversity and alignment; moderate $t$ 4 introduces sufficient split randomness to reduce tree-to-tree correlation without sacrificing local resolution (O'Connell, 13 Feb 2026).
Minimum terminal node size $t$ 5: Directly constrains leaf granularity and mediates the variance-bias trade-off.

A design-based analysis explicitly links these parameters to resolution, single-tree variance, and ensemble covariance. Effective tuning requires joint adjustment of these parameters, often via small grid searches, with practical guidance indicating that both under-smoothing and excessive randomness are detrimental (Duroux et al., 2016, O'Connell, 13 Feb 2026).

A table summarizing key design mechanisms and their primary effects:

Parameter	Effect on Bias/Variance	Comments
Subsample rate $t$ 6	Higher $t$ 7: lower bias, higher $t$ 8	$t$ 9 often optimal
Tree depth $x$ 0	Higher $x$ 1: lower bias, higher variance	Depth $x$ 2 balances
mtry $x$ 3	Lower $x$ 4: more random splits, lower $x$ 5	Default $x$ 6 for regression
Min node size $x$ 7	Larger $x$ 8: higher bias, lower variance	Stabilizes partitions

Continued research has extended classical random forests along several axes:

Oblique splitting and structured features: Methods such as Sparse Projection Oblique Randomer Forests (SPORF) and Manifold Oblique Random Forests (MORF) replace axis-aligned splits with (possibly sparse or structured) linear projections, improving classification accuracy in settings with strong feature interactions or manifold structure, while maintaining robustness and interpretability (Tomita et al., 2015, Li et al., 2019).
Design-based formalism and kernel view: Random forests admit interpretation as kernel estimators via the so-called KeRF framework, with each tree's partition inducing a data-adaptive kernel. This connection enables theoretical analysis of consistency rates and provides geometric insight into the adaptive weighting of neighbor points (Scornet, 2015).
Conditional density and distributional forests: Contemporary approaches generalize random forests to estimate full conditional distributions (not just means or quantiles), enabling plug-in estimation of moments, quantiles, copulas, or counterfactual effects. Distributional Random Forests leverage nonparametric splitting criteria such as the Maximum Mean Discrepancy (MMD), supporting functional and multivariate responses as well as causal inference tasks (Pospisil et al., 2019, Ćevid et al., 2020).
Game-theoretic and diversity-aware forests: Algorithms such as Banzhaf Random Forests (BRF) leverage the Banzhaf power index for higher-order feature selection, ensuring theoretical consistency, while Diversity Conscious Refined Random Forests prune trees and features to enforce ensemble diversity and reduce computational redundancy without sacrificing accuracy (Sun et al., 2015, Bhattarai et al., 1 Jul 2025).

6. Random Forests under Dependent Data and Noise Structure

The vanilla random forest assumes independent and identically distributed (i.i.d.) observations and errors. Recent theoretical advances extend random forests to settings with dependent data, such as spatial or time series contexts. The introduction of generalized least squares (RF-GLS) establishes consistency under $x$ 9-mixing errors and demonstrates that properly accounting for dependence (e.g., via pre-whitening and global covariance-weighted splits) results in substantial improvements in both theoretical and empirical performance (Saha et al., 2020).

Moreover, exogenous randomness—feature subsampling and random tie-breaking—plays a central role in decorrelating trees, reducing both bias and variance relative to single-tree methods, and can exploit the presence of noise features to enhance ensemble diversity ("blessing of noise features") (Mei et al., 2024).

7. Empirical Performance, Limitations, and Practical Guidance

Extensive empirical evaluations confirm the robustness and adaptability of random forests:

In high-dimensional, sparse settings, forests maintain accuracy by effectively focusing on informative features and avoiding overfitting to noise (Scornet et al., 2014, Biau, 2010).
With appropriate tuning of subsample size and tree depth, forests routinely outperform unpruned or non-subsampled baselines, with practical gains in test error of 5–10% (Duroux et al., 2016).
Enhanced variants such as regression-enhanced forests, conditional density forests, and diversity-conscious ensembles provide further improvements in specialized settings (e.g., extrapolation, distributional estimation, resource-constrained inference) (Zhang et al., 2019, Ćevid et al., 2020, Bhattarai et al., 1 Jul 2025).

However, random forests may suffer in structured data domains where feature-space geometry is not captured by axis-aligned splits, motivating the development of manifold-aware and oblique projection forests (Li et al., 2019, Tomita et al., 2015).

Tuning guidelines emphasize balancing bias and variance through joint optimization of tree depth, subsample size, feature subsampling rate, and stopping rules, with formal design-based analysis providing quantitative pathways for understanding parameter impacts (O'Connell, 13 Feb 2026, Duroux et al., 2016). In all cases, increasing the number of trees $\alpha \in (0,1]$ 0 beyond a moderate value yields diminishing returns once the Monte Carlo variance falls beneath the intrinsic structure-induced variance floor.

References:

(Duroux et al., 2016) Impact of subsampling and pruning on random forests
(O'Connell, 13 Feb 2026) Random Forests as Statistical Procedures: Design, Variance, and Dependence
(Genuer, 2010) Risk bounds for purely uniformly random forests
(Scornet et al., 2014) Consistency of random forests
(Biau, 2010) Analysis of a Random Forests Model
(Scornet, 2015) Random forests and kernel methods
(Tomita et al., 2015) Sparse Projection Oblique Randomer Forests
(Zhang et al., 2019) Regression-Enhanced Random Forests
(Pospisil et al., 2019) (f)RFCDE: Random Forests for Conditional Density Estimation and Functional Data
(Ćevid et al., 2020) Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression
(Sun et al., 2015) Banzhaf Random Forests
(Mei et al., 2024) Exogenous Randomness Empowering Random Forests
(Saha et al., 2020) Random Forests for dependent data
(Bhattarai et al., 1 Jul 2025) Diversity Conscious Refined Random Forest
(Li et al., 2019) Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks