Best-Scored Random Forest (BRF)

Updated 25 February 2026

Best-Scored Random Forest (BRF) is a framework that selects the optimal tree from a set of random candidates using a penalized empirical risk criterion.
It integrates purely random partitioning with an empirical scoring mechanism, ensuring consistency and competitive performance across various learning tasks.
BRF’s design supports applications in binary classification, regression, density estimation, and clustering, offering scalability and adaptability through parallelizable algorithms.

The Best-Scored Random Forest (BRF) is a general framework for statistical learning, characterized by the selection of the empirically best-performing tree among a collection of purely random candidates at each node of an ensemble. BRF has been systematically developed for binary classification, regression, nonparametric density estimation, and single-level density-based clustering. The canonical architecture features axis-aligned or oblique partitions, a regularized empirical risk or held-out score-based selection criterion, and ensemble averaging or voting. The approach yields strong theoretical guarantees on consistency and rates, demonstrates scalability with parallelizable algorithms, and achieves competitive empirical performance across supervised and unsupervised tasks (Hang et al., 2019, Hang et al., 2019, Hang et al., 2019, Hang et al., 2019).

1. Conceptual Overview and Key Algorithmic Principles

BRF generalizes the Random Forest paradigm by replacing deterministic or data-dependent splits with purely random partition policies, and, crucially, applies a post hoc empirical scoring procedure to select the most performant tree from a batch of randomly generated candidates. This principle applies to tree-based classification, regression, and density structures, making BRF a unifying methodological paradigm.

The distinguishing feature is the two-level source of randomness: partition proposals are generated independently per tree by randomizing split coordinates, locations, and affected leaves, while the final tree at each ensemble position is chosen as the empirical risk minimizer (with or without explicit regularization). This best-scoring mechanism favors trees that improve generalization by penalizing overfitting (via penalties on tree complexity such as the square number of splits).

Each BRF tree is constructed as follows (classification canonical version (Hang et al., 2019)):

For $\ell=1,\ldots,k$ $ℓ = 1, \dots, k$ , generate a purely random binary tree up to $p_\mathrm{max}$ $p_{max}$ splits:
- At each split, select a current leaf uniformly at random.
- Select a split coordinate $R_i$ uniformly.
- Select a cut location $S_i$ uniformly in $(0,1)$ .
For each candidate tree, compute penalized empirical risk $R_\ell=R_n(g_{Z_\ell}) + \lambda p_{Z_\ell}^2$ .
Select the tree with minimum penalized risk.

The forest comprises $m$ independent, best-scored trees, aggregated by majority vote (classification) or mean (regression/density estimation).

2. Theoretical Guarantees and Convergence Rates

BRF provides strong consistency and convergence results grounded in regularized empirical risk minimization and random partition theory.

Classification

Under Tsybakov’s margin noise and geometric noise conditions (exponent $\alpha$ for noise, $\beta$ for margin, $\gamma$ for geometric regularity), BRF with optimally chosen $\lambda$ yields rates

$R(f)-R^* = O\left(n^{-c_T \beta / [c_T \beta (2-\theta) + 4d]}\right),\quad \theta = \frac{\alpha}{1+\alpha},\quad c_T \approx 0.22.$

With large $\alpha$ and geometric-noise, the rate approaches the minimax $n^{-1}$ , up to logarithmic factors for the full forest (Hang et al., 2019).
Oracle inequalities quantify deviation from population-optimal selection; precise rates hinge on split regularization and partition dimension.

Regression

Under Hölder smoothness ( $\alpha$ ) of the regression function, the two-stage BRF (TBRF) achieves

$\mathcal{R}(g_Z)-\mathcal{R}^* + p^2(g_Z) = O\left(n^{-c_T \alpha / [c_T \alpha (1+\delta) + 2d]}\right),$

with analogous rates inherited by the BRF ensemble (Hang et al., 2019).

Density Estimation

For $\alpha$ –Hölder densities, with various tail behaviors, the $L_1$ -rate of the BRF estimator (for suitable $p_n$ ) is:

$\|f_{D,Z}-f\|_1 \lesssim (\log n / n)^{\frac{c_T \alpha \eta}{2\alpha(c_T \eta + 2) + 4d(\eta+1)}}$

under polynomial tails, with different exponents for exponential tails and compact support (Hang et al., 2019).

The ensemble estimator $f_{\rm BRF}$ inherits these rates.
$L_\infty$ rates have a similar dependency but are typically slower (higher denominator exponents).

Clustering

Consistency is established for cluster recovery when target clusters are separated and the level set geometry meets thickness and separation requirements. Given optimal cell sizing and error thresholds, the single-level BRF clustering algorithm converges in estimated level and recovered clusters (Hang et al., 2019).

3. Algorithmic Variants and Implementation Details

BRF variants adapt the core methodology to different statistical tasks:

Classification

Penalized empirical risk $R_n(g) + \lambda p(g)^2$ is minimized over $k$ random candidates, with regularization controlling splits.
Adaptive splitting: Instead of always uniformly random splits, select a random datapoint, split its leaf, thereby biasing partitioning toward data-dense regions. This tends to reduce the number of wasted or small cells, enhancing efficiency (Hang et al., 2019).

Regression (Two-Stage BRF)

Stage 1: The feature space is adaptively partitioned into $m$ cells using $m-1$ random splits; the cell to split is selected with probability proportional to sample density (majority vote among $t$ random draws), promoting finer partitions in dense regions.
Stage 2: Within each cell, a local best-scored random tree is trained as in the main protocol; leaf values can be constant means, linear predictors, or LS-SVMs, accommodating complex response structures (Hang et al., 2019).

Density Estimation

Each purely random density tree is scored by average negative log-likelihood (ANLL) on a validation set; the tree with lowest ANLL is retained.
Ensembles are averages of such best-scored trees, with partition depth, forest size, and candidate count as key hyperparameters (Hang et al., 2019).

Clustering

BRF clustering forms a nonparametric density estimator via a BRF, then applies a level-set approach to detect connected components as clusters.
For each density level, a $\tau$ -graph of data points is constructed to find connected components; clustering is declared when two persistent, connected components arise. Outliers are assigned by nearest neighbor (Hang et al., 2019).

4. Performance Evaluation and Empirical Results

BRF has been evaluated on a range of UCI classification, regression, and density estimation benchmarks.

Classification: On datasets such as MONK’s problems, Breast-Cancer Wisconsin, and Statlog Credit, BRF matches or slightly outperforms Breiman’s RF and Extremely Randomized Trees (Extra-Trees) in test error. Adaptive splitting significantly improves runtime and split efficiency, with statistical significance (Wilcoxon signed-rank test, $p<0.05$ ) (Hang et al., 2019).
Regression: TBRF achieves state-of-the-art MSE on large-scale datasets (TCO, SARCOS, MSD), often with lower wall-clock time compared to patchwork kriging and Voronoi-partition SVMs, owing to inherent parallelism (Hang et al., 2019).
Density Estimation: BRF outperforms kernel density estimation (KDE), DHT, and NADE on both synthetic mixtures and UCI tabular data, particularly as dimensionality increases. Training complexity remains tractable and parallelizable (Hang et al., 2019).
Clustering: On synthetic shapes (“noisy circles,” “moons,” “blobs”) and diverse real UCI datasets, BRF clustering attains or exceeds the adjusted Rand index (ARI) performance of DBSCAN, k-means, and PDF-cluster, naturally adapting to arbitrarily shaped clusters and noise (Hang et al., 2019).

5. Conditions for Consistency and Illustrative Counterexamples

The mathematical analysis demonstrates that for consistency, the random splitting process must afford every feature dimension a nonzero probability of being split. If the process restricts splitting to a strict subset of coordinates, certain labeling problems become intractable. For example, in the parity labeling of vertices on the $d$ -cube, only randomizing splits along a single coordinate leads to error rates of 50%, even for forests, since the classifier depends solely on a single projection and cannot reconstruct the full labeling (Hang et al., 2019).

This suggests that empirical tuning of randomization and regularization parameters must not compromise the irreducible axis randomness requirement.

6. Extensions: Balanced Random Forest and Generalization Directions

While the pure BRF regime focuses on random splits, the BRF framework has also appeared in extended forms such as Balanced Random Forest (BRF)—particularly for imbalanced classification (Monsalves et al., 29 Aug 2025). Here, each tree’s bootstrap sample is under-sampled to balance classes before growing a conventional Random Forest tree. This methodology, not to be conflated with purely random BRFs, has been successfully applied to Gaia DR3 astrophysical data for massive-star candidate selection, achieving high completeness ( $\sim 80\%$ ) but modest purity ( $\sim 50\%$ at $p\ge0.6$ ) due to intrinsic class difficulties.

More generally, the best-scored construction integrates naturally with adaptive or oblique splits, alternative leaf assignments (e.g., LS-SVMs), and higher-level learning tasks. The parallelism of the two-stage and best-scoring routines is directly compatible with distributed or large-scale machine learning.

7. Practical Considerations, Limitations, and Open Problems

BRF introduces additional hyperparameters: number of candidate trees per forest ( $k$ ), number of forest trees ( $m$ ), number of splits per tree ( $p$ ), and (for adaptive partitioning) sample size or voting scheme ( $t$ ). Theoretical rates inform scaling relations, but empirical cross-validation remains important for practical performance tuning.

Advantages include local adaptivity, strong resistance to dimensionality, ensemble smoothing to control boundary discontinuities, consistency under mild assumptions, and straightforward parallelization.

Limitations and open directions include:

Sensitivity to hyperparameter selection.
Need for approximate volume calculation with oblique splits.
Conservativeness of theoretical constants in rate bounds.
Potential benefits of further integrating data-dependent splits, as opposed to purely random or uniformly random coordinate selection.

The BRF framework, supported by a sequence of works (Hang et al., 2019, Hang et al., 2019, Hang et al., 2019, Hang et al., 2019), provides a unified, theoretically robust, and computationally scalable ensemble methodology for both supervised and unsupervised machine learning.