Prediction-Augmented Residual Tree (PART)

Updated 26 October 2025

The paper introduces PART, a tree-based estimator that augments ML predictions with partitioned residual corrections to achieve improved statistical efficiency and tighter confidence intervals.
It details an adaptive tree algorithm that minimizes variance through local residual adjustments, ensuring asymptotic normality and enhanced error calibration.
Empirical evaluations across ecology, astronomy, census, and bioinformatics highlight PART’s robust performance compared to traditional global debiasing methods.

The Prediction-Augmented Residual Tree (PART) is an adaptive, tree-based estimator that combines ML predictions with classical residual-based corrections to produce statistically efficient and robust inference across heterogeneous domains. PART leverages a small set of gold-standard labeled samples, a large set of unlabeled data, and a machine learning predictor to construct an augmented decision tree estimator with asymptotic guarantees and improved confidence intervals. The methodology formalizes a partitioned residual correction strategy over the feature space and stands as a significant advancement over global debiasing estimators, exhibiting strong empirical and theoretical results across ecology, astronomy, census, and bioinformatics datasets.

1. Theoretical Foundations and Motivation

Prediction augmentation via residual trees is rooted in the challenge of combining high-throughput ML predictors with limited high-fidelity labeled data for reliable scientific inference. The canonical setting assumes access to $n$ labeled samples $\mathcal{L} = \{(x_i, y_i)\}_{i=1}^n$ , $N \gg n$ unlabeled samples $\mathcal{U} = \{\widetilde{x}_j\}_{j=1}^N$ , and an ML model $f: \mathcal{X} \to \mathbb{R}$ capable of imputing labels $f(\widetilde{x}_j)$ for the unlabeled instances. Earlier approaches, such as Prediction-Powered Inference (PPI) [Angelopoulos et al.], apply a global residual correction to debias $f(x)$ using observed residuals $r_i = y_i - f(x_i)$ :

$\widehat{\mu}_{\mathrm{PPI}} = \frac{1}{N}\sum_{j=1}^N f(\widetilde{x}_j) + \frac{1}{n}\sum_{i=1}^n r_i$

However, PPI and its variants fail to exploit heterogeneity in $r(x)$ , leading to suboptimal error bounds. PART generalizes this by partitioning $\mathcal{X}$ based on the residual structure, refining corrections for regions where ML predictions systematically deviate.

2. Construction and Statistical Properties

PART constructs a decision tree over $\mathcal{X}$ by recursively partitioning labeled and unlabeled data, using a variance-minimizing criterion. For each candidate split (feature coordinate $k$ , threshold $s$ ), partitions $\mathcal{R}_{\mathrm{left}}$ and $\mathcal{R}_{\mathrm{right}}$ are evaluated with respect to the "Variance of Mixture of Splits" (VMS):

$\mathrm{VMS}(k, s, \mathcal{R}) = p_{\mathrm{left}}^2 \frac{\widehat{\sigma}_{\mathrm{left}}^2}{n_{\mathrm{left}}} + p_{\mathrm{right}}^2 \frac{\widehat{\sigma}_{\mathrm{right}}^2}{n_{\mathrm{right}}}$

Here,

$p_{\mathrm{left}}$ , $p_{\mathrm{right}}$ : Proportional weights estimated from unlabeled data
$\widehat{\sigma}_{\mathrm{left}}^2$ , $\widehat{\sigma}_{\mathrm{right}}^2$ : Empirical residual variances in subtrees
$n_{\mathrm{left}}$ , $n_{\mathrm{right}}$ : Number of labeled instances in each subtree

The optimal split minimizes VMS at each node. Recursive splitting stops after a fixed depth $D$ or insufficient labeled samples in a region.

The estimator aggregates predictions as:

$\hat{\mu}_T = \frac{1}{N}\sum_{j=1}^{N} f(\widetilde{x}_j) + \sum_{\ell=1}^{L} p_\ell\,\overline{r}_\ell,$

where $\overline{r}_\ell$ is the mean residual in leaf $\mathcal{R}_\ell$ , and $p_\ell$ is the estimated mass from $\mathcal{U}$ .

The asymptotic distribution is normal:

$\sqrt{n}(\hat{\mu}_T - \mu) \xrightarrow{d} \mathcal{N}\left(0, \sum_{\ell=1}^L p_\ell \sigma_\ell^2\right)$

allowing construction of Wald-type confidence intervals:

$\Bigl[\hat{\mu}_T - z_{1-\alpha/2}\widehat{\sigma},\, \hat{\mu}_T + z_{1-\alpha/2}\widehat{\sigma}\Bigr]$

where $\widehat{\sigma}^2 = \sum_{\ell=1}^L p_\ell^2\,\frac{\widehat{\sigma}_\ell^2}{n_\ell}$ and $z_{1-\alpha/2}$ is the standard normal quantile.

3. Algorithmic Details

The greedy tree construction for PART is inspired by CART, but with a loss function tailored to estimator variance (rather than predictive accuracy). At each node, candidate splits are searched across quantile thresholds of the feature axes, and selection is based on minimizing VMS.

Leaf residuals and weights are computed from available data:

For labeled samples: mean residual $\overline{r}_\ell = \frac{1}{n_\ell} \sum_{(x, y) \in \mathcal{L} \cap \mathcal{R}_\ell}(y - f(x))$
For unlabeled samples: region weight $p_\ell = |\{x_j \in \mathcal{U} \cap \mathcal{R}_\ell\}|/N$

This process yields a partition-adaptive augmentation, sensitive to local model bias.

4. Performance and Empirical Evaluation

PART empirical superiority over global correction methods (PPI, PPI++) is demonstrated on real-world datasets:

In ecology (e.g., satellite-based estimates of deforestation rates), PART yields tighter confidence intervals and higher coverage.
In astronomy (fraction of spiral galaxies), and census (demographic ratio estimation), the method robustly combines gold-standard samples and ML predictions for increased reliability.
In protein property prediction, PART gives higher-confidence odds ratio estimations.

By correcting residuals locally and leveraging large unlabeled pools for robust regional weighting, PART achieves a marked reduction in confidence interval length and heightened estimator confidence.

5. Asymptotic Theory and the PAQ Limit

The Prediction-Augmented Quadrature (PAQ) estimator arises as a limiting case of PART when tree depth is sent to infinity and each region contains minimal labeled samples. The bias and variance of PAQ satisfy:

$|E[\mu_{PAQ}] - E[Y]| = O(n^{-2}), \qquad Var(\mu_{PAQ}) = O(N^{-1} + n^{-4})$

By contrast, global methods achieve only $O(N^{-1} + n^{-1})$ variance reduction. The $n^{-4}$ term is enabled by high-order error cancellation in smooth residual regimes, reflecting the utility of partitioned quadrature for efficient de-biasing.

A plausible implication is that in domains where $r(x)$ is smooth, deep PART (or PAQ) offers exponential gains in statistical efficiency over prior estimators.

PART integrates concepts from broader tree-based prediction augmentation:

The adaptive partitioning and local residual correction strategy shares foundational ideas with Sparse Residual Trees and Forests (Xu et al., 2019), which optimize hierarchical residual refinement for scattered data.
In high-dimensional and deep-tree scenarios, using complete tree proposals as in Particle Gibbs (Lakshminarayanan et al., 2015) is advantageous for posterior exploration and uncertain prediction augmentation.
Connection to probabilistic trees (Quentin et al., 7 Feb 2025) is evident when targeting distributional outputs and calibrated intervals, suggesting PART could be extended for distributional inference.

The estimator’s design is general, making it applicable for reliable inference pipelines in scientific discovery contexts where ML predictors and small ground-truth sets coexist and estimator confidence is paramount.

Estimator	Correction Strategy	Variance Rate
PPI	Global (mean residual)	$O(N^{-1} + n^{-1})$
PART	Partitioned (tree residual)	$O(N^{-1} + n^{-1})$
Deep PART / PAQ	Infinitesimal partitions	$O(N^{-1} + n^{-4})$

PART delivers improved error calibration via adaptive residual partitioning, making it well-suited for settings with structured model bias and limited labeled data.

8. Concluding Remarks

PART represents an overview of learning-augmented estimation, adaptive bias correction, and decision tree methodology. By enabling localized estimator corrections and leveraging the abundance of unlabeled data for robust regional weighting, PART advances the state-of-the-art in statistical inference with ML integration. Its asymptotic normality and variance reduction results—especially the $O(n^{-4})$ rate of PAQ under smoothness—highlight its utility in modern scientific and analytical pipelines (Kher et al., 19 Oct 2025).