An RKHS Perspective on Tree Ensembles (2512.00397v1)

Published 29 Nov 2025 in stat.ML and cs.LG

Abstract: Random Forests and Gradient Boosting are among the most effective algorithms for supervised learning on tabular data. Both belong to the class of tree-based ensemble methods, where predictions are obtained by aggregating many randomized regression trees. In this paper, we develop a theoretical framework for analyzing such methods through Reproducing Kernel Hilbert Spaces (RKHSs) constructed on tree ensembles -- more precisely, on the random partitions generated by randomized regression trees. We establish fundamental analytical properties of the resulting Random Forest kernel, including boundedness, continuity, and universality, and show that a Random Forest predictor can be characterized as the unique minimizer of a penalized empirical risk functional in this RKHS, providing a variational interpretation of ensemble learning. We further extend this perspective to the continuous-time formulation of Gradient Boosting introduced by Dombry and Duchamps, and demonstrate that it corresponds to a gradient flow on a Hilbert manifold induced by the Random Forest RKHS. A key feature of this framework is that both the kernel and the RKHS geometry are data-dependent, offering a theoretical explanation for the strong empirical performance of tree-based ensembles. Finally, we illustrate the practical potential of this approach by introducing a kernel principal component analysis built on the Random Forest kernel, which enhances the interpretability of ensemble models, as well as GVI, a new geometric variable importance criterion.

Summary

The paper introduces a theoretical framework linking random forests and gradient boosting to RKHS by constructing data-dependent kernels from tree partitions.
The paper demonstrates that the associated RKHS provides a variational characterization and implicit regularization, enhancing model interpretability.
Empirical benchmarks reveal that RF kernel PCA improves classification and regression performance while geometric variable importance offers robust feature analysis.

An RKHS Perspective on Tree Ensembles: Technical Summary

Introduction and Motivation

This paper develops a unifying theoretical framework that rigorously connects tree-based ensemble methods—specifically Random Forests (RF) and Gradient Boosting (GB)—to Reproducing Kernel Hilbert Space (RKHS) theory. While these algorithms dominate supervised learning on tabular domains, particularly via recursive partitioning and aggregation of randomized regression trees, mathematical analyses have traditionally lagged behind empirical successes. The key innovation here is the introduction and characterization of kernels and RKHS structures intrinsic to tree ensembles, particularly those arising from random partitions generated by randomized decision trees.

Construction of Random Forest Kernels and RKHS

The formalism begins with a general definition of random partitions over $[0,1]^p$ , representing hyperrectangles (tree leaves). Each RF induces data-dependent weights and, equivalently, a symmetric, positive semi-definite kernel that encodes the co-occurrence of data points within leaves across the ensemble. The canonical RF predictor is thereby recast as a weighted kernel averaging: $\hat{F}_n(x) = \sum_{i=1}^n W_{ni}(x) Y_i,$ with corresponding kernel (for $M$ trees): $k_n^M(z, z') = \frac{n}{M} \sum_{m=1}^M \sum_{l=1}^{N_m} \frac{\mathbf{1}_{\{z, z' \in A_{m,l}\}}}{|A_{m,l}|}$ where $A_{m,l}$ is a leaf in tree $m$ .

Comprehensive analysis establishes that the induced RF kernel is measurable, bounded (under natural assumptions), and typically continuous, especially when tree splits are randomized over atomless distributions.

Analytical Properties of Associated RKHSs

For both RF and its limit as the number of trees grows, the paper constructs an explicit RKHS structure:

Functions in the RF RKHS are shown to be bounded, measurable, and, under suitable conditions, continuous and uniformly equicontinuous.
The norm of a function $F$ in the RKHS is related to the strength of its alignment with the partition-induced geometry, and is tightly connected to variance decomposition in regression contexts.
The kernel is characteristic whenever the block intensity measure over partitions is determining, leading to RKHS density in spaces of continuous functions under mild path-wise regularity conditions.

A notable result is the explicit characterization: $\bar T(\cdot; P,\mu) = \int_{[0,1]^p \times \mathbb{R}} y\, k(x, \cdot)\, P(dx, dy)$ which interprets the RF predictor as a kernel mean embedding of $Y$ against the partition geometry of the input space.

Variational Characterization and Gradient Structures

The RF predictor emerges as the unique minimizer of a penalized least-squares objective in its RKHS: $\operatorname{minimize}_F \quad -2 P[yF(x)] + \|F\|^2_{\mathcal{H}}$ This variational perspective reveals that ensemble partitioning induces implicit regularization, controlling function complexity in the data-adaptive RKHS.

Further, the paper extends the framework to continuous-time boosting algorithms, showing that infinitesimal gradient boosting (IGB)—as the learning rate approaches zero—can be rigorously interpreted as a gradient flow on an infinite-dimensional Hilbert manifold (the RF-induced RKHS), subject to a risk or empirical loss functional. The RKHS geometry here is data- and predictor-dependent, varying smoothly with the underlying function estimate.

Kernel Methods, Geometric Variable Importance, and Empirical Results

The RF kernel unlocks practical applications in kernel methods, notably kernel PCA for both classification and regression. Empirical benchmarks on UCI datasets demonstrate that representations yielded by RF kernels enhance class separation and downstream linear model performance, outperforming standard kernels (e.g., Gaussian RBF) under default hyperparameter regimes.

The geometric variable importance (GVI) measure is introduced, quantifying the proportion of variance in each feature captured by the RF kernel geometry. Unlike classical MDI and MDA, GVI leverages the entire ensemble’s local averaging structure and is extremely fast to compute at scale. Simulation studies show GVI aligns closely with permutation-based importance but is robust to redundant features, categorical distractors, and high dimensionality.

Empirical Highlights:

RF kernel PCA produces superior linear separability for classification and regression as measured by accuracy, silhouette scores, and relative error improvements.
Effective sample sizes, induced by RF weights, vary widely by construction, impacting geometric regularization and generalization.
GVI demonstrates competitive precision in signal selection and separation in a suite of synthetic scenarios, including additive, collinear, XOR, and context-dependent signals.

Theoretical Implications

This kernel-centric viewpoint for ensemble tree methods bridges traditional nonparametric regression and modern kernel methods. It supplies:

A principled answer as to why RFs perform well—data-adaptive RKHS capacity controls coupled to partitioning, balancing fit and regularity.
Means to analyze limiting behavior, consistency, and generalization by referencing properties of the associated RKHS and the induced kernel operator.
Pathways to extend RF/GB beyond tabular contexts via geometric constructions in manifold settings.

Future Directions

Potential extensions include:

Kernel-based confidence intervals and uncertainty quantification via effective sample size and kernel operator spectra.
Application of RF kernels to kernel ridge regression, support vector machines, and manifold learning with adaptive metrics.
Analysis of the fluctuations and finite-sample limits of gradient flows (e.g., functional CLT regimes for IGB),
Investigations into algorithmic biases induced by partition geometry in high dimensions.

Conclusion

By embedding tree ensembles into a rigorous RKHS framework, the paper provides new analytical tools to resolve the geometry, regularization, and optimization landscape underpinning Random Forests and Gradient Boosting, with direct implications for interpretability, variable importance, and kernelized algorithm design. This work sharply refines the theoretical foundations for ensemble learning, facilitating both deeper mathematical inquiry and robust practical improvements in model analysis and deployment.