Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 54 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 333 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Centered Random Forests

Updated 30 June 2025

Centered Random Forests are ensemble methods that build decision trees by splitting at the midpoint of feature cells using random coordinates, ensuring data-independent partitions.
They incorporate an explicit kernel representation (KeRF) that quantifies connection probabilities, enabling precise bias–variance analysis and convergence rate derivations.
They extend to applications such as context-dependent feature analysis and imbalanced classification, providing actionable insights in high-dimensional learning tasks.

Centered Random Forests (CRF) are a class of ensemble learning algorithms in which each decision tree is constructed using splits performed at the center of feature-space cells, with the splitting coordinate chosen uniformly at random at every node. Originally formulated to achieve rigorous theoretical properties and to facilitate connections to kernel methods, CRFs and their kernel-based analogs (termed KeRF) have gained prominence in theoretical and applied machine learning for their interpretability, explicit analytic structure, and suitability for high-dimensional and imbalanced data tasks.

1. Definition and Construction of Centered Random Forests

A centered random forest is constructed as an ensemble of decision trees in which, at each node, a coordinate (feature) is chosen uniformly at random and the split is performed at the center (midpoint) of the cell along that coordinate. Trees are typically grown to a fixed depth $k$ so that each tree contains exactly $2^k$ leaf nodes. The resulting partition of the unit hypercube $[0,1]^d$ is independent of the data, depending only on the random splitting sequence. Each tree predicts by averaging the outcomes of the training samples within the terminal cell containing the test point.

The prediction of an entire CRF ensemble at point $\mathbf{x}$ aggregates over all trees by computing the mean of all individual tree predictions: $m_{M,n}(\mathbf{x}, \Theta_1, \ldots, \Theta_M) = \frac{1}{M} \sum_{j=1}^M m_n(\mathbf{x}, \Theta_j)$ where $\Theta_j$ denotes the tree-specific randomization and $m_n(\mathbf{x}, \Theta_j)$ is the average outcome for the terminal cell containing $\mathbf{x}$ in tree $j$ .

2. Kernel Representation (KeRF) and Explicit Kernel Structure

A significant theoretical advance is the representation of CRF predictions as a kernel estimator, where the “kernel” quantifies the probability that two points are assigned to the same terminal cell under the random partition process. In the infinite ensemble limit, this approach yields the so-called Kernel based on Random Forests (KeRF).

The kernelized CRF (CRF-KeRF) prediction at $\mathbf{x}$ is: $\widetilde{m}_{\infty, n}^{cc}(\mathbf{x}) = \frac{ \sum_{i=1}^n Y_i K_k^{cc}(\mathbf{x}, \mathbf{X}_i) }{ \sum_{\ell=1}^n K_k^{cc}(\mathbf{x}, \mathbf{X}_\ell) }$ where the kernel $K_k^{cc}$ is given by

$K_{k}^{cc}(\mathbf{x}, \mathbf{z}) = \sum_{k_{1}+\cdots+k_{d}=k} \frac{k!}{k_{1}!\cdots k_{d}!}\left(\frac{1}{d}\right)^k \prod_{j=1}^d \mathds{1}_{\lceil 2^{k_j} x_j \rceil = \lceil 2^{k_j} z_j \rceil}$

This explicit multinomial formula computes the probability, over the randomness in tree construction, that $\mathbf{x}$ and $\mathbf{z}$ share all splitting histories across all coordinates for a total of $k$ splits.

The kernel is positive semi-definite, interpretable as a “connection probability,” and enables the application of kernel methods theory to CRFs (Scornet, 2015).

3. Theoretical Properties: Consistency and Convergence Rates

Centered KeRF estimators are shown to be consistent under standard regularity conditions (e.g., Lipschitz-continuous regression functions, uniform feature distribution). Explicit upper bounds for the mean squared risk are available. Notably, for CRF of depth $k$ and sample size $n$ , if $k \rightarrow \infty$ and $n / 2^k \rightarrow \infty$ , the estimator satisfies: $\mathbb{E}\left[\left(\widetilde{m}_{\infty, n}^{cc}(\mathbf{X}) - m(\mathbf{X})\right)^2\right] \leq C_1 n^{-1/(3 + d\log 2)}(\log n)^2$ where $C_1 > 0$ is a constant and $d$ is the feature dimension (Scornet, 2015). This rate, while generally slower than the minimax optimal rate $n^{-2/(d+2)}$ for kernel regression, improves upon previous rates for nonadaptive random forests in high dimensions and reflects the interplay between cell diameter (bias) and cell occupancy (variance).

Recent work (Iakovidis et al., 2023) further sharpens this rate to

$n^{-1/(1 + d\log 2)} \log n$

and provides a detailed analysis of the associated reproducing kernel Hilbert space (RKHS), including exact kernel eigenstructure and effective RKHS dimension. The RKHS has much lower effective dimension than the space of all possible tree paths, implicitly regularizing function estimation.

4. Empirical Properties and Comparisons

Empirical evaluations have demonstrated that:

Centered KeRF performs comparably to, or better than, standard centered random forests, especially in nonadaptive settings where cell sizes are fixed and may be irregular.
Performance metrics focus on empirical $L^2$ prediction risk.
For Breiman forests (feature and split selection adaptive to data), kernelization (KeRF) preserves prediction risk equivalence.
The performance of finite KeRF converges to its infinite (theoretical) limit as the number of trees increases.
Computational overhead is similar to standard random forests for finite ensembles; explicit kernel computation can be intensive for large $n$ or $d$ .
Bootstrapping has little impact on empirical performance for either ensemble or kernelized estimators (Scornet, 2015).

A summary comparison is provided below:

Aspect	Centered RF	Centered KeRF (CRF-KeRF)
Prediction aggregation	Average per tree	Average over all data points via kernel
Kernel function	Implicit	Explicit, analytic $K_k^{cc}$
Interpretability	Partition-dependent	Connection probability
Consistency rate (upper bound)	$n^{-3/(4d\log2+3)}$	$n^{-1/(3+d\log2)}(\log n)^2$
Empirical performance	Strong	Comparable/superior for nonadaptive forests
Theoretical analysis	Challenging	Tractable via kernel methods

5. Application to Context-Dependent Feature Analysis

The centered random forest framework has been extended to context-dependent feature relevance analysis (Sutera et al., 2016). In this setting, the relevance of input variables to the prediction target is allowed to depend on a context variable (e.g., gender, environment, disease subtype). CRF-based mutual information scores are computed at each node, enabling detection and quantification of variables whose predictive power is modulated by context. The key identification criterion is: $Imp^{|x_c|}(X_m) = \frac{1}{N_T} \sum_T \sum_{t:v(s_t)=X_m} p(t) \left| I(Y; X_m \mid t) - I(Y; X_m \mid t, X_c=x_c) \right|$ A variable is context-independent if and only if $Imp^{|x_c|}(X_m) = 0$ for all context values. This methodology is robust to nonlinear and multivariate interactions and can reveal context-complementary or context-redundant structures in real and synthetic data.

Applications include:

Biomedical science: context-specific biomarkers or gene regulatory interactions
Social science and economics: demographic- or time-specific covariate relevance
Artificial intelligence: conditional informativeness in multi-source data

6. Infinite CRF, Asymptotic Normality, and Imbalanced Classification

For the infinite ensemble limit (ICRF), recent results establish a Central Limit Theorem (CLT) with explicit rate and variance constants, allowing for statistical inference on predictions made by centered random forests (Mayala et al., 10 Jun 2025). If the ensemble is trained on a rebalanced dataset (e.g., to counter class imbalance), the estimator is biased but can be corrected via an importance sampling (IS) adjustment, yielding an IS-ICRF estimator: $\widehat\mu^{\mathrm{IS-ICRF}}(\mathbf{x}) = \frac{n_1(1-p')\widehat\mu_{\mathrm{RB}}^{\mathrm{ICRF}}(\mathbf{x})}{p' n_0 (1-\widehat\mu_{\mathrm{RB}}^{\mathrm{ICRF}}(\mathbf{x})) + n_1 (1-p') \widehat\mu_{\mathrm{RB}}^{\mathrm{ICRF}}(\mathbf{x})}$ This debiased estimator enjoys a CLT centered at the true regression function and displays significant variance reduction relative to ICRF trained on the original, highly imbalanced dataset, particularly as the minority class proportion decreases. Experimental results confirm the theoretical rates and variances in both the idealized ICRF and Breiman’s practical random forests.

7. Discussion, Impact, and Future Directions

Centered Random Forests provide a rigorous, analyzable framework for ensemble learning by marrying the mechanics of random partitioning with the interpretability and analytic tractability of kernel methods. The explicit nature of their connection kernel enables precise bias-variance analysis, supports theoretical advances such as sharp convergence rates, and facilitates new applications in feature analysis and imbalanced learning. Their data-independent splitting strategy simplifies statistical analysis while maintaining competitive empirical performance, particularly in nonadaptive regimes or high-dimensional problems. Extensions to generalized feature analysis and bias-corrected inference in imbalanced classification underline their versatility and foundational role in modern nonparametric statistical learning.