Papers
Topics
Authors
Recent
Search
2000 character limit reached

Random Forests: Ensemble Learning Power and Practice

Updated 24 June 2026
  • Random Forests are ensemble learning methods that aggregate multiple decision trees to enhance predictive accuracy and reduce variance.
  • They excel with high-dimensional data and maintain strong performance even with noise, making them ideal for diverse real-world applications.
  • Key mechanisms involve data subsampling and random feature selection to ensure diverse tree structures and lower model correlation.

Random forests are ensemble learning algorithms constructing a collection of randomized decision trees and combining their predictions by averaging (for regression) or majority vote (for classification). Originating with Breiman (2001), random forests have become a canonical nonparametric machine learning method due to their adaptability to high-dimensional data, robustness to noise, and strong empirical performance across tabular and structured data domains. The core mechanism is to inject randomization into both input-data subsampling and the feature-selection process at each tree split, thereby generating an ensemble of diverse weak learners whose aggregation yields a predictive model with substantially reduced variance and strong theoretical guarantees in a variety of settings (Duroux et al., 2016, O'Connell, 13 Feb 2026, Scornet et al., 2014).

1. Random Forest Construction and Fundamental Principles

A random forest aggregates a set of TT decision trees, each grown independently on a subsample (usually via bootstrap or without replacement) of the training data Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n. At each split node within a tree, a random subset of mtrym_\mathrm{try} features is selected, and among these candidates, the optimal split is chosen according to a node-impurity criterion (e.g., Gini impurity for classification, variance reduction for regression).

Let ht(x;α,d)h_t(x; \alpha, d) denote the prediction of the ttth tree at xx, built with subsampling rate α∈(0,1]\alpha \in (0,1] (so that each tree is trained on an=⌊αn⌋a_n = \lfloor \alpha n \rfloor observations) and maximum depth dd. The forest predictor is

fT,α,d(x)=1T∑t=1Tht(x;α,d).f_{T, \alpha, d}(x) = \frac{1}{T} \sum_{t=1}^T h_t(x; \alpha, d).

Randomness enters at two key points: (1) subsample selection for each tree (bagging) and (2) feature subset sampling at each split. The result is a Monte Carlo average over explicit randomization seeds, yielding an estimator that can be written as a data-adaptive weighted average: Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n0 with weights determined by tree structure and ensemble composition (O'Connell, 13 Feb 2026, Scornet, 2015).

2. Bias–Variance Decomposition and Design-Based Variance Structure

Random forests' predictive error decomposes into bias, variance, and irreducible noise terms. At a point Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n1, the mean squared error is

Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n2

where variance is due both to the finite sample and randomization over tree structure.

Critically, in the limit as Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n3, the predictive variance does not vanish but converges to an intrinsic "structural covariance" floor Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n4 induced by the design of the forest. This arises from two mechanisms (O'Connell, 13 Feb 2026):

  • Observation reuse (Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n5): Nonzero covariance occurs whenever two trees within the ensemble rely on overlapping training examples for predictions at Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n6.
  • Partition alignment (Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n7): Even if trees are trained on disjoint subsamples, the data-adaptive partitioning can lead to similar local structure, increasing correlation between tree predictions.

The total variance for Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n8 trees satisfies

Dn={(Xi,Yi)}i=1n\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n9

where only the Monte Carlo component decays as mtrym_\mathrm{try}0 increases; the structural covariance is irreducible (O'Connell, 13 Feb 2026). Fine-grained decomposition using the law of total variance reveals further contributions from partition instability and propagation of response variance within leaves.

3. Statistical Consistency, Minimax Rates, and Adaptivity

Theoretical analysis of random forests demonstrates mtrym_\mathrm{try}1-consistency under a wide range of model classes, provided tree complexity and subsample size are tuned to satisfy certain balance conditions. For Breiman's forests, consistency is guaranteed if the number of leaves per tree grows, but not too quickly relative to the subsample size mtrym_\mathrm{try}2, and if the randomization mechanisms maintain sufficient diversity between tree partitions (Scornet et al., 2014, Biau, 2010, Duroux et al., 2016). In particular:

  • For additive regression models, fully-grown trees built on subsamples mtrym_\mathrm{try}3 with mtrym_\mathrm{try}4 and mtrym_\mathrm{try}5 yield mtrym_\mathrm{try}6-consistency (Scornet et al., 2014).
  • Random forests adapt to sparsity: when only mtrym_\mathrm{try}7 features are truly informative, the rate of convergence depends only on mtrym_\mathrm{try}8, not the ambient dimension mtrym_\mathrm{try}9 (Biau, 2010).
  • In one-dimensional settings with Lipschitz regression functions, simplified (purely uniformly random) forests achieve the minimax rate ht(x;α,d)h_t(x; \alpha, d)0, with the variance reduced by a factor ht(x;α,d)h_t(x; \alpha, d)1 compared to a single tree (Genuer, 2010).

The dominant error terms arise from the trade-off between bias (controlled by tree depth) and variance (inflated by small sample sizes per leaf). For median forests, the mean squared error at ht(x;α,d)h_t(x; \alpha, d)2 obeys an upper bound

ht(x;α,d)h_t(x; \alpha, d)3

where ht(x;α,d)h_t(x; \alpha, d)4 is the number of splits (depth), and ht(x;α,d)h_t(x; \alpha, d)5 governs exponential bias decay (Duroux et al., 2016).

4. Design Mechanisms: Subsampling, Pruning, Randomization, and Tuning

Random forests' statistical and computational properties are governed by the interplay of several design parameters:

  • Subsampling rate ht(x;α,d)h_t(x; \alpha, d)6: Determines the fraction of data used for each tree. Subsampling at ht(x;α,d)h_t(x; \alpha, d)7–ht(x;α,d)h_t(x; \alpha, d)8 achieves error comparable to default bootstrap, with lower computational cost, and can occasionally outperform it by reducing variance (Duroux et al., 2016).
  • Tree depth ht(x;α,d)h_t(x; \alpha, d)9 or pruning: Shallow trees (small tt0) are high-bias, low-variance; fully grown trees are low-bias, high-variance. Pruning to limit the number of terminal nodes to tt1–tt2 frequently minimizes generalization error.
  • Feature subsampling (mtry tt3): Governs partition diversity and alignment; moderate tt4 introduces sufficient split randomness to reduce tree-to-tree correlation without sacrificing local resolution (O'Connell, 13 Feb 2026).
  • Minimum terminal node size tt5: Directly constrains leaf granularity and mediates the variance-bias trade-off.

A design-based analysis explicitly links these parameters to resolution, single-tree variance, and ensemble covariance. Effective tuning requires joint adjustment of these parameters, often via small grid searches, with practical guidance indicating that both under-smoothing and excessive randomness are detrimental (Duroux et al., 2016, O'Connell, 13 Feb 2026).

A table summarizing key design mechanisms and their primary effects:

Parameter Effect on Bias/Variance Comments
Subsample rate tt6 Higher tt7: lower bias, higher tt8 tt9 often optimal
Tree depth xx0 Higher xx1: lower bias, higher variance Depth xx2 balances
mtry xx3 Lower xx4: more random splits, lower xx5 Default xx6 for regression
Min node size xx7 Larger xx8: higher bias, lower variance Stabilizes partitions

5. Extensions and Variants: Oblique Forests, Conditional Distributions, and Structural Refinements

Continued research has extended classical random forests along several axes:

  • Oblique splitting and structured features: Methods such as Sparse Projection Oblique Randomer Forests (SPORF) and Manifold Oblique Random Forests (MORF) replace axis-aligned splits with (possibly sparse or structured) linear projections, improving classification accuracy in settings with strong feature interactions or manifold structure, while maintaining robustness and interpretability (Tomita et al., 2015, Li et al., 2019).
  • Design-based formalism and kernel view: Random forests admit interpretation as kernel estimators via the so-called KeRF framework, with each tree's partition inducing a data-adaptive kernel. This connection enables theoretical analysis of consistency rates and provides geometric insight into the adaptive weighting of neighbor points (Scornet, 2015).
  • Conditional density and distributional forests: Contemporary approaches generalize random forests to estimate full conditional distributions (not just means or quantiles), enabling plug-in estimation of moments, quantiles, copulas, or counterfactual effects. Distributional Random Forests leverage nonparametric splitting criteria such as the Maximum Mean Discrepancy (MMD), supporting functional and multivariate responses as well as causal inference tasks (Pospisil et al., 2019, Ćevid et al., 2020).
  • Game-theoretic and diversity-aware forests: Algorithms such as Banzhaf Random Forests (BRF) leverage the Banzhaf power index for higher-order feature selection, ensuring theoretical consistency, while Diversity Conscious Refined Random Forests prune trees and features to enforce ensemble diversity and reduce computational redundancy without sacrificing accuracy (Sun et al., 2015, Bhattarai et al., 1 Jul 2025).

6. Random Forests under Dependent Data and Noise Structure

The vanilla random forest assumes independent and identically distributed (i.i.d.) observations and errors. Recent theoretical advances extend random forests to settings with dependent data, such as spatial or time series contexts. The introduction of generalized least squares (RF-GLS) establishes consistency under xx9-mixing errors and demonstrates that properly accounting for dependence (e.g., via pre-whitening and global covariance-weighted splits) results in substantial improvements in both theoretical and empirical performance (Saha et al., 2020).

Moreover, exogenous randomness—feature subsampling and random tie-breaking—plays a central role in decorrelating trees, reducing both bias and variance relative to single-tree methods, and can exploit the presence of noise features to enhance ensemble diversity ("blessing of noise features") (Mei et al., 2024).

7. Empirical Performance, Limitations, and Practical Guidance

Extensive empirical evaluations confirm the robustness and adaptability of random forests:

  • In high-dimensional, sparse settings, forests maintain accuracy by effectively focusing on informative features and avoiding overfitting to noise (Scornet et al., 2014, Biau, 2010).
  • With appropriate tuning of subsample size and tree depth, forests routinely outperform unpruned or non-subsampled baselines, with practical gains in test error of 5–10% (Duroux et al., 2016).
  • Enhanced variants such as regression-enhanced forests, conditional density forests, and diversity-conscious ensembles provide further improvements in specialized settings (e.g., extrapolation, distributional estimation, resource-constrained inference) (Zhang et al., 2019, Ćevid et al., 2020, Bhattarai et al., 1 Jul 2025).

However, random forests may suffer in structured data domains where feature-space geometry is not captured by axis-aligned splits, motivating the development of manifold-aware and oblique projection forests (Li et al., 2019, Tomita et al., 2015).

Tuning guidelines emphasize balancing bias and variance through joint optimization of tree depth, subsample size, feature subsampling rate, and stopping rules, with formal design-based analysis providing quantitative pathways for understanding parameter impacts (O'Connell, 13 Feb 2026, Duroux et al., 2016). In all cases, increasing the number of trees α∈(0,1]\alpha \in (0,1]0 beyond a moderate value yields diminishing returns once the Monte Carlo variance falls beneath the intrinsic structure-induced variance floor.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Forests.