Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Centered Random Forests (CRF)

Updated 23 June 2025

Centered Random Forests (CRF) are a class of ensemble learning algorithms in which each decision tree is constructed using splits performed at the center of feature-space cells, with the splitting coordinate chosen uniformly at random at every node. Originally formulated to achieve rigorous theoretical properties and to facilitate connections to kernel methods, CRFs and their kernel-based analogs (termed KeRF) have gained prominence in theoretical and applied machine learning for their interpretability, explicit analytic structure, and suitability for high-dimensional and imbalanced data tasks.

1. Definition and Construction of Centered Random Forests

A centered random forest is constructed as an ensemble of decision trees in which, at each node, a coordinate (feature) is chosen uniformly at random and the split is performed at the center (midpoint) of the cell along that coordinate. Trees are typically grown to a fixed depth kk so that each tree contains exactly 2k2^k leaf nodes. The resulting partition of the unit hypercube [0,1]d[0,1]^d is independent of the data, depending only on the random splitting sequence. Each tree predicts by averaging the outcomes of the training samples within the terminal cell containing the test point.

The prediction of an entire CRF ensemble at point x\mathbf{x} aggregates over all trees by computing the mean of all individual tree predictions: mM,n(x,Θ1,,ΘM)=1Mj=1Mmn(x,Θj)m_{M,n}(\mathbf{x}, \Theta_1, \ldots, \Theta_M) = \frac{1}{M} \sum_{j=1}^M m_n(\mathbf{x}, \Theta_j) where Θj\Theta_j denotes the tree-specific randomization and mn(x,Θj)m_n(\mathbf{x}, \Theta_j) is the average outcome for the terminal cell containing x\mathbf{x} in tree jj.

2. Kernel Representation (KeRF) and Explicit Kernel Structure

A significant theoretical advance is the representation of CRF predictions as a kernel estimator, where the “kernel” quantifies the probability that two points are assigned to the same terminal cell under the random partition process. In the infinite ensemble limit, this approach yields the so-called Kernel based on Random Forests (KeRF).

The kernelized CRF (CRF-KeRF) prediction at x\mathbf{x} is: m~,ncc(x)=i=1nYiKkcc(x,Xi)=1nKkcc(x,X)\widetilde{m}_{\infty, n}^{cc}(\mathbf{x}) = \frac{ \sum_{i=1}^n Y_i K_k^{cc}(\mathbf{x}, \mathbf{X}_i) }{ \sum_{\ell=1}^n K_k^{cc}(\mathbf{x}, \mathbf{X}_\ell) } where the kernel KkccK_k^{cc} is given by

$K_{k}^{cc}(\mathbf{x}, \mathbf{z}) = \sum_{k_{1}+\cdots+k_{d}=k} \frac{k!}{k_{1}!\cdots k_{d}!}\left(\frac{1}{d}\right)^k \prod_{j=1}^d \mathds{1}_{\lceil 2^{k_j} x_j \rceil = \lceil 2^{k_j} z_j \rceil}$

This explicit multinomial formula computes the probability, over the randomness in tree construction, that x\mathbf{x} and z\mathbf{z} share all splitting histories across all coordinates for a total of kk splits.

The kernel is positive semi-definite, interpretable as a “connection probability,” and enables the application of kernel methods theory to CRFs (Scornet, 2015 ).

3. Theoretical Properties: Consistency and Convergence Rates

Centered KeRF estimators are shown to be consistent under standard regularity conditions (e.g., Lipschitz-continuous regression functions, uniform feature distribution). Explicit upper bounds for the mean squared risk are available. Notably, for CRF of depth kk and sample size nn, if kk \rightarrow \infty and n/2kn / 2^k \rightarrow \infty, the estimator satisfies: E[(m~,ncc(X)m(X))2]C1n1/(3+dlog2)(logn)2\mathbb{E}\left[\left(\widetilde{m}_{\infty, n}^{cc}(\mathbf{X}) - m(\mathbf{X})\right)^2\right] \leq C_1 n^{-1/(3 + d\log 2)}(\log n)^2 where C1>0C_1 > 0 is a constant and dd is the feature dimension (Scornet, 2015 ). This rate, while generally slower than the minimax optimal rate n2/(d+2)n^{-2/(d+2)} for kernel regression, improves upon previous rates for nonadaptive random forests in high dimensions and reflects the interplay between cell diameter (bias) and cell occupancy (variance).

Recent work (Iakovidis et al., 2023 ) further sharpens this rate to

n1/(1+dlog2)lognn^{-1/(1 + d\log 2)} \log n

and provides a detailed analysis of the associated reproducing kernel Hilbert space (RKHS), including exact kernel eigenstructure and effective RKHS dimension. The RKHS has much lower effective dimension than the space of all possible tree paths, implicitly regularizing function estimation.

4. Empirical Properties and Comparisons

Empirical evaluations have demonstrated that:

  • Centered KeRF performs comparably to, or better than, standard centered random forests, especially in nonadaptive settings where cell sizes are fixed and may be irregular.
  • Performance metrics focus on empirical L2L^2 prediction risk.
  • For Breiman forests (feature and split selection adaptive to data), kernelization (KeRF) preserves prediction risk equivalence.
  • The performance of finite KeRF converges to its infinite (theoretical) limit as the number of trees increases.
  • Computational overhead is similar to standard random forests for finite ensembles; explicit kernel computation can be intensive for large nn or dd.
  • Bootstrapping has little impact on empirical performance for either ensemble or kernelized estimators (Scornet, 2015 ).

A summary comparison is provided below:

Aspect Centered RF Centered KeRF (CRF-KeRF)
Prediction aggregation Average per tree Average over all data points via kernel
Kernel function Implicit Explicit, analytic KkccK_k^{cc}
Interpretability Partition-dependent Connection probability
Consistency rate (upper bound) n3/(4dlog2+3)n^{-3/(4d\log2+3)} n1/(3+dlog2)(logn)2n^{-1/(3+d\log2)}(\log n)^2
Empirical performance Strong Comparable/superior for nonadaptive forests
Theoretical analysis Challenging Tractable via kernel methods

5. Application to Context-Dependent Feature Analysis

The centered random forest framework has been extended to context-dependent feature relevance analysis (Sutera et al., 2016 ). In this setting, the relevance of input variables to the prediction target is allowed to depend on a context variable (e.g., gender, environment, disease subtype). CRF-based mutual information scores are computed at each node, enabling detection and quantification of variables whose predictive power is modulated by context. The key identification criterion is: Impxc(Xm)=1NTTt:v(st)=Xmp(t)I(Y;Xmt)I(Y;Xmt,Xc=xc)Imp^{|x_c|}(X_m) = \frac{1}{N_T} \sum_T \sum_{t:v(s_t)=X_m} p(t) \left| I(Y; X_m \mid t) - I(Y; X_m \mid t, X_c=x_c) \right| A variable is context-independent if and only if Impxc(Xm)=0Imp^{|x_c|}(X_m) = 0 for all context values. This methodology is robust to nonlinear and multivariate interactions and can reveal context-complementary or context-redundant structures in real and synthetic data.

Applications include:

  • Biomedical science: context-specific biomarkers or gene regulatory interactions
  • Social science and economics: demographic- or time-specific covariate relevance
  • Artificial intelligence: conditional informativeness in multi-source data

6. Infinite CRF, Asymptotic Normality, and Imbalanced Classification

For the infinite ensemble limit (ICRF), recent results establish a Central Limit Theorem (CLT) with explicit rate and variance constants, allowing for statistical inference on predictions made by centered random forests (Mayala et al., 10 Jun 2025 ). If the ensemble is trained on a rebalanced dataset (e.g., to counter class imbalance), the estimator is biased but can be corrected via an importance sampling (IS) adjustment, yielding an IS-ICRF estimator: μ^ISICRF(x)=n1(1p)μ^RBICRF(x)pn0(1μ^RBICRF(x))+n1(1p)μ^RBICRF(x)\widehat\mu^{\mathrm{IS-ICRF}}(\mathbf{x}) = \frac{n_1(1-p')\widehat\mu_{\mathrm{RB}}^{\mathrm{ICRF}}(\mathbf{x})}{p' n_0 (1-\widehat\mu_{\mathrm{RB}}^{\mathrm{ICRF}}(\mathbf{x})) + n_1 (1-p') \widehat\mu_{\mathrm{RB}}^{\mathrm{ICRF}}(\mathbf{x})} This debiased estimator enjoys a CLT centered at the true regression function and displays significant variance reduction relative to ICRF trained on the original, highly imbalanced dataset, particularly as the minority class proportion decreases. Experimental results confirm the theoretical rates and variances in both the idealized ICRF and Breiman’s practical random forests.

7. Discussion, Impact, and Future Directions

Centered Random Forests provide a rigorous, analyzable framework for ensemble learning by marrying the mechanics of random partitioning with the interpretability and analytic tractability of kernel methods. The explicit nature of their connection kernel enables precise bias-variance analysis, supports theoretical advances such as sharp convergence rates, and facilitates new applications in feature analysis and imbalanced learning. Their data-independent splitting strategy simplifies statistical analysis while maintaining competitive empirical performance, particularly in nonadaptive regimes or high-dimensional problems. Extensions to generalized feature analysis and bias-corrected inference in imbalanced classification underline their versatility and foundational role in modern nonparametric statistical learning.