Extremely Randomized Trees Explained

Updated 3 January 2026

Extremely Randomized Trees are an ensemble method that builds decision trees with randomized feature and threshold selections, reducing variance compared to traditional forests.
They enhance computational efficiency by evaluating only a subset of random splits per node, making them effective in sparsely sampled regions.
Their theoretical consistency under probabilistic impurity decrease conditions and superior performance in boosting applications make them valuable for practical regression and classification tasks.

Extremely Randomized Trees (Extra-Trees) are a variant of ensemble methods for regression and classification that introduce increased randomization into the induction of decision trees, both in feature selection and threshold determination. The principal motivation for Extra-Trees is to further reduce variance beyond that achievable by Random Forests, enabling more robust aggregation of piecewise constant predictors, improved computational efficiency, and, in certain regimes, better handling of sparsely sampled regions in the feature space (Konstantinov et al., 2020, Blum et al., 2023).

1. Formal Definition and Algorithmic Structure

Extra-Trees, introduced by Geurts, Ernst, and Wehenkel (2006), construct a binary tree by, at each internal node:

Drawing $K$ features (with $K \leq m$ , the number of available features) uniformly at random.
For each selected feature $j_k$ , choosing a split-threshold $t_k$ uniformly at random within the range of observed values of feature $j_k$ for the current node’s subset.
Computing the standard impurity reduction (e.g., variance for regression).
Selecting the split $(j_{k^*}, t_{k^*})$ that maximizes this reduction.

Unlike standard decision trees or classical Random Forests, which search for the optimal split among all candidate thresholds (typically midpoints in sorted feature values), Extra-Trees randomize not just feature selection but also threshold, expanding the diversity among trees in the ensemble.

For regression, the impurity at a node with data $D = \{(x_i, y_i)\}_{i=1}^n$ is: $I(D) = \mathrm{Var}(D) = \frac{1}{|D|} \sum_{(x, y) \in D} (y - \bar{y})^2,$ where $\bar{y}$ denotes the sample mean in $D$ . At each node, Extra-Trees propose $K$ splits and branch on the split yielding the maximal decrease in variance.

The process continues recursively until a stopping criterion is reached. At the leaf level, prediction is usually the mean (for regression) or class-distribution estimate (for classification).

The following outlines the core Extra-Trees algorithm for a single regression tree:

def BuildExtraTree(D, depth):
    if depth == 0 or len(D) < n_min:
        return Leaf(mean_y(D))
    for k in range(K):
        j_k = random_feature()
        t_k = random_uniform(min_xj(D, j_k), max_xj(D, j_k))
        D_left, D_right = split_data(D, j_k, t_k)
        delta_var[k] = variance_reduction(D, D_left, D_right)
    k_star = argmax(delta_var)
    left_subtree = BuildExtraTree(D_left[k_star], depth - 1)
    right_subtree = BuildExtraTree(D_right[k_star], depth - 1)
    return Node(j_k_star, t_k_star, left_subtree, right_subtree)

Here,

K

is a tunable parameter controlling randomization;

K = m

(all features) with random thresholds recovers the fully randomized Extra-Trees regime, while

K = 1

maximizes randomization (Konstantinov et al., 2020).

2. Theoretical Properties and Consistency

A key theoretical advance is the establishment of $L_2$ -consistency for the Extra-Trees ensemble under probabilistic impurity-decrease conditions. Specifically (Blum et al., 2023):

The Extra-Trees estimator constructs trees up to depth $k_n \to \infty$ , with $k_n < c \log_2 n$ , $c < 1/4$ , and aggregates responses at each leaf.
Consistency is established under the “probabilistic sufficient impurity decrease” (SID) condition (C1*), which requires that for each cell, there exists a probability $\delta > 0$ such that, among $n_{\mathrm{split}}$ random uniform splits, with probability at least $\delta$ , a candidate split realizing a fractionally sufficient impurity decrease is found.
Any regression function $m$ for which sufficiently many random splits yield nontrivial impurity reduction — even if no axis-aligned split is optimal (e.g., for highly interactive targets) — can be consistently estimated.

The main consistency result states: $\lim_{n \to \infty} E[(\hat{m}_T(X) - m(X))^2] = 0$ for tree ensembles with $n_{\mathrm{split}}$ satisfying $n_{\mathrm{split}} \geq -2\log 2 / \log(1 - \delta)$ and tree depth settings as above.

This covers target functions beyond those covered by classical SID conditions, notably functions requiring multiple coordinate splits to locally reduce error (“pure interaction” forms). This suggests Extra-Trees achieve $L_2$ -consistency across a strictly broader function class than purely deterministic-split forests (Blum et al., 2023).

Decision forests such as Random Forests (Breiman, 2001) and Extra-Trees differ in sources of randomization and search strategy:

Method	Feature Subsampling	Threshold Selection	Split Score
CART	No	All midpoints	Best overall
Random Forest	Random (subset)	All midpoints	Best per subset
Extra-Trees	Random (subset)	Random within range	Best random proposal

Standard CART-based methods perform exhaustive threshold search per feature, which is computationally heavy ( $O(Nm)$ per node), and, due to the deterministic alignment of split points, can introduce discontinuities when training data are sparse. Extra-Trees, by randomizing the split position within the observed range, break large inter-point gaps differently across trees, improving ensemble smoothness and reducing persistent discontinuities (Konstantinov et al., 2020).

A partially randomized variant (termed "Partially Randomized Trees") considers all $m$ features per node, but chooses a single uniform random threshold per feature and splits on the best of $m$ random candidates. This hybrid maintains computational efficiency ( $O(m)$ per node), further accelerates tree induction (yielding $N$ -fold speed-ups within boosting), and injects sufficient variability in split locations to fill gaps in sparse regions (Konstantinov et al., 2020).

4. Computational Aspects and Scalability

The primary computational advantage of Extra-Trees arises from the elimination of axis-aligned midpoint search. At each node, only $K$ split evaluations are performed:

Standard CART: $O(Nm)$ per node (all midpoints).
Extra-Trees: $O(K)$ per node (typically $K \ll N$ ).
Partially Randomized Trees: $O(m)$ per node.

For a full binary tree of depth $d$ , building a single Extra-Tree requires $O(K \cdot 2^d)$ operations, compared to $O(Nm \cdot 2^d)$ for classical trees. In gradient boosting or forest settings with $M$ trees, this enables considerable global speed-up. Empirically, training times are reduced by $4-5\times$ relative to standard Gradient Boosting Machines (GBM) with deterministic trees on large-scale datasets (Konstantinov et al., 2020).

5. Implications for Function Approximation and Smoothness

Deterministic trees (CART) create discontinuities at leaf boundaries, which are restricted to observed midpoints, leading to persistent stepwise errors in sparsely sampled regions. Extra-Trees, because tree partitions vary randomly across the ensemble, divide large inter-point gaps at different locations, thus, the averaged ensemble achieves denser and more evenly distributed piecewise-constant approximations.

This property is especially beneficial in regression: the expected discontinuity of the ensemble’s prediction is smaller, and the approximation adapts more smoothly in regions where data is sparse — boosting algorithms with partially randomized trees demonstrate mitigated step artifacts compared to their fully deterministic counterparts (Konstantinov et al., 2020).

6. Empirical Performance and Application in Boosting

Experiments on nine regression benchmarks (California housing, Boston, diabetes, HouseART, Friedman 1–3, various synthetic datasets) indicate that Gradient Boosting Machines (GBM) using partially randomized trees achieve lower mean squared error (MSE) on 8 out of 9 datasets, outperforming Random Forests, Extra-Trees regression forests, XGBoost, and CatBoost under comparable cross-validation regimes and hyperparameter search.

For boosting, the integration of partially randomized trees involves computing pseudo-residuals at each step, fitting a regression tree using randomized threshold splits for each feature, and updating the additive model. The ensemble of such partially randomized base learners yields improved accuracy and computational tractability (Konstantinov et al., 2020).

7. Parameter Constraints and Practical Considerations

Successful application of Extra-Trees requires setting:

Number of random candidate splits per node ( $K$ or $n_{\mathrm{split}}$ ): $n_{\mathrm{split}} \gtrsim 10-20$ is typical to ensure high probability of sufficient impurity decrease.
Tree depth: $k \to \infty$ with $k < c \log_2 n$ , $c < \tfrac{1}{4}$ , or equivalently, splitting until each leaf has $< n^\eta$ points, $\eta < \tfrac{1}{8}$ .
No further regularization (e.g., minimum leaf size) is required beyond bounding maximum tree depth (Blum et al., 2023).

A plausible implication is that, absent post-pruning or classical regularization, Extra-Trees ensembles can achieve consistency and low variance provided random-split parameters are properly tuned.

References

Konstantinov, A.V., Utkin, L.V. "Gradient boosting machine with partially randomized decision trees" (Konstantinov et al., 2020).
Consistency of Random Forest Type Algorithms under a Probabilistic Impurity Decrease Condition (Blum et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Gradient boosting machine with partially randomized decision trees (2020)

Consistency of Random Forest Type Algorithms under a Probabilistic Impurity Decrease Condition (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Extremely Randomized Trees.