Papers
Topics
Authors
Recent
Search
2000 character limit reached

Forest of Model Trees

Updated 23 November 2025
  • Forest of model trees is an ensemble learning method where each tree partitions the input by data-adaptive hyperplanes and fits local linear models at the leaves.
  • It employs convolutional regularization and smooth C¹ blending to produce continuously differentiable regressors, enhancing robustness against input perturbations.
  • The training algorithm guarantees convergence by recursively fitting least-squares models with a tilt constraint, ensuring precise, low-variance predictions.

A forest of model trees is an ensemble learning approach in which each base learner is a model tree—specifically, a tree that partitions the input space by means of data-adaptive hyperplanes at internal nodes and fits local linear models at the leaves. Recent developments focus on application domains such as function approximation over high-dimensional images, where the method leverages down-sampling, convolutional regularization, and smooth C¹ blending to produce continuously differentiable regressors with provable convergence guarantees (Armstrong, 16 Nov 2025). These model tree forests stand in contrast to classical piecewise constant decision tree ensembles, and are conceptually related to, but distinct from, transformation forests that aggregate local conditional distribution models in a parametric framework (Hothorn et al., 2017).

1. Formal Definition and Structure

Let dd denote the (optionally down-sampled) dimensionality of the input, typically a vectorized image, and H=i=1d[0,wi]H = \prod_{i=1}^d [0,w_i] denote an axis-aligned hyper-rectangle (HR). A model tree TT is a full binary tree with the following elements:

  • Internal nodes nn store hyperplane (HP) functions Sn:HRS_n: H \to \mathbb{R},
  • Leaves \ell store linear functions F:HRF_\ell: H \to \mathbb{R}.

Inputs xHx \in H are routed down the tree according to the sign of Sn(x)S_n(x). For node nn with center mRdm \in \mathbb{R}^d and least-squares fit coefficients α(n)\alpha^{(n)}, the split function is

Sn(x)=i=1dαi(n)(ximi).S_n(x) = \sum_{i=1}^d \alpha_i^{(n)}\, (x_i - m_i).

At a leaf \ell, the local regression is

F(x)=i=1dβi()(xici())+Y(),F_\ell(x) = \sum_{i=1}^d \beta_i^{(\ell)} (x_i - c_i^{(\ell)}) + Y^{(\ell)},

where β()\beta^{(\ell)} are least-squares coefficients, c()c^{(\ell)} is the centroid of samples in the leaf, and Y()Y^{(\ell)} is the average label.

A forest of such model trees comprises MM independently constructed trees, each providing both a prediction Tj(x)T_j(x) and a leaf-specific weight wj(x)[0,1]w_j(x) \in [0,1], combined into a weighted average output:

fF(x)=j=1Mwj(x)Tj(x)j=1Mwj(x).f_F(x) = \frac{\sum_{j=1}^M w_j(x)\,T_j(x)}{\sum_{j=1}^M w_j(x)}.

2. Down-Sampling and Input Preprocessing

Prior to constructing model trees for image data, the images are down-sampled by partitioning the original grid into non-overlapping k×kk \times k blocks and representing each super-pixel by the average intensity in that block. This yields a lower-dimensional input vector xx of length d=(D/k)2d = (D/k)^2 for images of dimension D×DD \times D. This dimensionality reduction not only accelerates least-squares fitting but also modifies the meaning of hyperplanes and regression coefficients, which now operate over super-pixels (Armstrong, 16 Nov 2025).

3. Convolutional Regularization of Hyperplanes

To impart robustness against localized distortions (e.g., minor translations or small deformations in images), the hyperplane coefficient grid α(n)\alpha^{(n)} at each node is convolved with a spatial kernel GG. For a position pp on the down-sampled grid, the convolved coefficients are:

α~p(n)=(α(n)G)(p)=qPG(pq)  αq(n).\tilde{\alpha}_p^{(n)} = (\alpha^{(n)} * G)(p) = \sum_{q \in P} G(p - q)\;\alpha_q^{(n)}.

After reshaping α(n)\alpha^{(n)} to match the spatial grid, this convolution is implemented prior to inference, meaning no additional runtime overhead at prediction. The resulting split tests Sn(x)S_n(x) are computed using the convolved weights, enhancing generalization under slight input perturbations (Armstrong, 16 Nov 2025).

4. Forest Construction and Ensemble Prediction

A forest of model trees may be constructed by two principal methods:

  • Training trees on independent bootstrap samples of the training data.
  • Training a base tree, then perturbing the centers mm of each node's HR by a small vector δ\delta and re-fitting the splits.

Each tree produces a weight wj(x)w_j(x) that is inherently smooth due to the application of smoothing kernels at all split nodes. The overall forest prediction fF(x)f_F(x) is their weighted average as specified above. The denominator is safeguarded against degenerate cases (i.e., jwj(x)=0\sum_j w_j(x) = 0) by including a "helper tree" with near-zero output and minimal weight (Armstrong, 16 Nov 2025).

Ensembling diverse trees in this manner ensures that the sharp discontinuities of individual trees are averaged out, reducing prediction variance.

5. Smooth Blending and Output Regularity

To overcome the inherent discontinuity of tree-based methods at split boundaries, model trees equip each node split with a C1C^1-continuous smoothing kernel:

W(t)={3t22t30t1 1t>1W(t) = \begin{cases} 3t^2 - 2t^3 & 0 \leq t \leq 1 \ 1 & t > 1 \end{cases}

with t=Sn(x)/hnt = |S_n(x)| / h_n, where hnh_n is the margin width at node nn. The node-wise weight is wn(x)=W(Sn(x)/hn)w_n(x) = W(|S_n(x)|/h_n), and the per-tree leaf weight is the product over the path to \ell, w(x)=k=1Lwnk(x)w_\ell(x) = \prod_{k=1}^L w_{n_k}(x). This smoothing produces a forest-level output fF(x)f_F(x) that is globally C1C^1—i.e., continuously differentiable—provided the base function ff is itself continuously differentiable and other technical criteria are met (Armstrong, 16 Nov 2025).

6. Training Algorithm and Convergence Guarantee

The training procedure recursively fits least-squares regressions at each node. If the RMSE of the fit is below a threshold ϵ\epsilon, the block is made a leaf; otherwise, the most influential split axis is selected based on importances ιi=αihi\iota_i = |\alpha_i|\cdot h_i. A "tilt-constraint" with factor τ(0,1)\tau \in (0,1) enforces that only nearly-axis-aligned splits are permitted:

ikιiτιk,\sum_{i \neq k} \iota_i \leq \tau \iota_k,

where kk is the axis of highest importance. If this constraint fails, low-importance coefficients are suppressed to ensure geometric shrinkage of the partition blocks. The algorithm guarantees convergence: for any C1C^1 function ff defined over HH, recursion halts in finite time, and in each leaf, the linear fit achieves RMSE at most ϵ\epsilon (Armstrong, 16 Nov 2025). No explicit regularization is required under idealized assumptions and with sufficiently dense sampling.

Model tree forests as described above are structurally different from transformation forests (Hothorn et al., 2017), although both aggregate tree-based predictors with locally adaptive models at the leaves. Whereas model trees partition feature space and fit piecewise linear regressors, transformation trees fit parametric transformation models at leaves that capture the entire conditional distribution. Transformation forests then aggregate these models via forest weights, yielding local maximum-likelihood estimates of the conditional law and facilitating prediction intervals and quantile regression. This suggests that the "forest of model trees" approach is particularly targeted at regression over high-dimensional structured domains (e.g., images), leveraging local linearity, convolutional structure, and explicit C1C^1 smoothing, whereas transformation forests focus on local distribution estimation via adaptive likelihood aggregation.


Key References for this methodology:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Forest of Model Trees.