Centered Random Forests (CRF)
Centered Random Forests (CRF) are a class of ensemble learning algorithms in which each decision tree is constructed using splits performed at the center of feature-space cells, with the splitting coordinate chosen uniformly at random at every node. Originally formulated to achieve rigorous theoretical properties and to facilitate connections to kernel methods, CRFs and their kernel-based analogs (termed KeRF) have gained prominence in theoretical and applied machine learning for their interpretability, explicit analytic structure, and suitability for high-dimensional and imbalanced data tasks.
1. Definition and Construction of Centered Random Forests
A centered random forest is constructed as an ensemble of decision trees in which, at each node, a coordinate (feature) is chosen uniformly at random and the split is performed at the center (midpoint) of the cell along that coordinate. Trees are typically grown to a fixed depth so that each tree contains exactly leaf nodes. The resulting partition of the unit hypercube is independent of the data, depending only on the random splitting sequence. Each tree predicts by averaging the outcomes of the training samples within the terminal cell containing the test point.
The prediction of an entire CRF ensemble at point aggregates over all trees by computing the mean of all individual tree predictions: where denotes the tree-specific randomization and is the average outcome for the terminal cell containing in tree .
2. Kernel Representation (KeRF) and Explicit Kernel Structure
A significant theoretical advance is the representation of CRF predictions as a kernel estimator, where the “kernel” quantifies the probability that two points are assigned to the same terminal cell under the random partition process. In the infinite ensemble limit, this approach yields the so-called Kernel based on Random Forests (KeRF).
The kernelized CRF (CRF-KeRF) prediction at is: where the kernel is given by
$K_{k}^{cc}(\mathbf{x}, \mathbf{z}) = \sum_{k_{1}+\cdots+k_{d}=k} \frac{k!}{k_{1}!\cdots k_{d}!}\left(\frac{1}{d}\right)^k \prod_{j=1}^d \mathds{1}_{\lceil 2^{k_j} x_j \rceil = \lceil 2^{k_j} z_j \rceil}$
This explicit multinomial formula computes the probability, over the randomness in tree construction, that and share all splitting histories across all coordinates for a total of splits.
The kernel is positive semi-definite, interpretable as a “connection probability,” and enables the application of kernel methods theory to CRFs (Scornet, 2015 ).
3. Theoretical Properties: Consistency and Convergence Rates
Centered KeRF estimators are shown to be consistent under standard regularity conditions (e.g., Lipschitz-continuous regression functions, uniform feature distribution). Explicit upper bounds for the mean squared risk are available. Notably, for CRF of depth and sample size , if and , the estimator satisfies: where is a constant and is the feature dimension (Scornet, 2015 ). This rate, while generally slower than the minimax optimal rate for kernel regression, improves upon previous rates for nonadaptive random forests in high dimensions and reflects the interplay between cell diameter (bias) and cell occupancy (variance).
Recent work (Iakovidis et al., 2023 ) further sharpens this rate to
and provides a detailed analysis of the associated reproducing kernel Hilbert space (RKHS), including exact kernel eigenstructure and effective RKHS dimension. The RKHS has much lower effective dimension than the space of all possible tree paths, implicitly regularizing function estimation.
4. Empirical Properties and Comparisons
Empirical evaluations have demonstrated that:
- Centered KeRF performs comparably to, or better than, standard centered random forests, especially in nonadaptive settings where cell sizes are fixed and may be irregular.
- Performance metrics focus on empirical prediction risk.
- For Breiman forests (feature and split selection adaptive to data), kernelization (KeRF) preserves prediction risk equivalence.
- The performance of finite KeRF converges to its infinite (theoretical) limit as the number of trees increases.
- Computational overhead is similar to standard random forests for finite ensembles; explicit kernel computation can be intensive for large or .
- Bootstrapping has little impact on empirical performance for either ensemble or kernelized estimators (Scornet, 2015 ).
A summary comparison is provided below:
Aspect | Centered RF | Centered KeRF (CRF-KeRF) |
---|---|---|
Prediction aggregation | Average per tree | Average over all data points via kernel |
Kernel function | Implicit | Explicit, analytic |
Interpretability | Partition-dependent | Connection probability |
Consistency rate (upper bound) | ||
Empirical performance | Strong | Comparable/superior for nonadaptive forests |
Theoretical analysis | Challenging | Tractable via kernel methods |
5. Application to Context-Dependent Feature Analysis
The centered random forest framework has been extended to context-dependent feature relevance analysis (Sutera et al., 2016 ). In this setting, the relevance of input variables to the prediction target is allowed to depend on a context variable (e.g., gender, environment, disease subtype). CRF-based mutual information scores are computed at each node, enabling detection and quantification of variables whose predictive power is modulated by context. The key identification criterion is: A variable is context-independent if and only if for all context values. This methodology is robust to nonlinear and multivariate interactions and can reveal context-complementary or context-redundant structures in real and synthetic data.
Applications include:
- Biomedical science: context-specific biomarkers or gene regulatory interactions
- Social science and economics: demographic- or time-specific covariate relevance
- Artificial intelligence: conditional informativeness in multi-source data
6. Infinite CRF, Asymptotic Normality, and Imbalanced Classification
For the infinite ensemble limit (ICRF), recent results establish a Central Limit Theorem (CLT) with explicit rate and variance constants, allowing for statistical inference on predictions made by centered random forests (Mayala et al., 10 Jun 2025 ). If the ensemble is trained on a rebalanced dataset (e.g., to counter class imbalance), the estimator is biased but can be corrected via an importance sampling (IS) adjustment, yielding an IS-ICRF estimator: This debiased estimator enjoys a CLT centered at the true regression function and displays significant variance reduction relative to ICRF trained on the original, highly imbalanced dataset, particularly as the minority class proportion decreases. Experimental results confirm the theoretical rates and variances in both the idealized ICRF and Breiman’s practical random forests.
7. Discussion, Impact, and Future Directions
Centered Random Forests provide a rigorous, analyzable framework for ensemble learning by marrying the mechanics of random partitioning with the interpretability and analytic tractability of kernel methods. The explicit nature of their connection kernel enables precise bias-variance analysis, supports theoretical advances such as sharp convergence rates, and facilitates new applications in feature analysis and imbalanced learning. Their data-independent splitting strategy simplifies statistical analysis while maintaining competitive empirical performance, particularly in nonadaptive regimes or high-dimensional problems. Extensions to generalized feature analysis and bias-corrected inference in imbalanced classification underline their versatility and foundational role in modern nonparametric statistical learning.