Calibrated Prediction-Powered Inference

Updated 4 July 2026

Calibrated prediction-powered inference is a framework that combines limited high-quality labels with extensive pseudo-labeled data, using rectification to produce unbiased estimators.
It employs explicit calibration conditions—such as score calibration and probability calibration—to ensure valid uncertainty quantification and efficiency across various inferential and decision-making tasks.
The approach extends classical survey-sampling methods with inverse-probability weighting, empirical likelihood, and Bayesian corrections to maintain nominal coverage even when pseudo-data dominate.

Calibrated prediction-powered inference denotes a class of inferential and decision procedures that combine a small gold-standard labeled sample with abundant prediction-only or pseudo-labeled data, while imposing explicit calibration conditions so that uncertainty quantification or downstream performance guarantees remain valid. In the classical statistical formulation, the target parameter is estimated by combining a prediction term with a labeled rectifier; in related algorithmic formulations, calibrated probabilities are inserted directly into online policies, with guarantees stated in terms of calibration error, mean-squared error, or score alignment (Datta et al., 13 Aug 2025, Lee et al., 7 Jun 2026, Shen et al., 5 Feb 2025). Across this literature, calibration appears in several distinct technical forms—design-based validity under unequal labeling, score calibration for semiparametric efficiency, calibrated empirical distributions in empirical likelihood, and probability calibration for online decisions—but the common principle is that predictions are not treated as truth and must be corrected, weighted, or constrained.

1. Foundations: rectification, estimands, and survey-sampling roots

The canonical prediction-powered setup observes covariates $X_i$ for a large population, gold-standard outcomes $Y_i$ for only a labeled subset, and predictions $\hat Y_i = f(X_i)$ for all units. In the finite-population mean problem, the estimand is

$\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$

The basic PPI estimator is

$\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$

where $R_i\in\{0,1\}$ indicates whether $Y_i$ is observed (Datta et al., 13 Aug 2025). Under simple random labeling, the rectifier is an unbiased estimate of the population mean prediction error, so the resulting estimator is unbiased for $\theta^*$ regardless of the quality of $f$ (Datta et al., 13 Aug 2025).

This rectified structure is not restricted to means. In general convex-loss formulations, the parameter of interest is defined by

$\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$

with first-order condition

$Y_i$ 0

PPI decomposes $Y_i$ 1 into a prediction-based term and a rectifier term, then estimates each separately (Cortinovis et al., 4 Feb 2025). This extends the framework beyond simple means to general $Y_i$ 2-estimation targets.

A central historiographic result is that the standard PPI estimator for a mean is algebraically identical to the difference estimator of Cassel et al. (1976), while PPI++ is algebraically identical to the generalized regression estimator (GREG) with tuning parameter $Y_i$ 3 identified with the sample regression coefficient (Mozer, 19 Mar 2026). In the survey-sampling view, the labeled sample is a probability sample $Y_i$ 4, predictions are auxiliary variables known for all $Y_i$ 5 units, and the rectifier is a model-assisted calibration device rather than a model-based imputation step. This survey-sampling equivalence clarifies that PPI validity is fundamentally design-based whenever the labeling mechanism is controlled.

The same literature also distinguishes finite-population and superpopulation interpretations. In the design-based formulation, $Y_i$ 6 and $Y_i$ 7 are fixed and randomness comes from the sampling design; in the superpopulation formulation, $Y_i$ 8 are i.i.d. draws. The two views yield asymptotically compatible variance formulas when $Y_i$ 9, but they imply different targets and different cautionary points for subgroup estimands and causal contrasts (Mozer, 19 Mar 2026).

2. Calibration as a family of conditions

One recurring source of confusion is that “calibration” is not a single condition. The recent literature uses the term for several non-equivalent properties, each tied to a particular inferential or algorithmic objective.

In online algorithm design, the calibration condition is a probability-calibration identity. For a target $\hat Y_i = f(X_i)$ 0 and predictor $\hat Y_i = f(X_i)$ 1, the predictor is calibrated over $\hat Y_i = f(X_i)$ 2 if

$\hat Y_i = f(X_i)$ 3

For binary targets this becomes

$\hat Y_i = f(X_i)$ 4

and approximate calibration is quantified by the max calibration error

$\hat Y_i = f(X_i)$ 5

This condition allows conditional event probabilities to be controlled by $\hat Y_i = f(X_i)$ 6 itself, which is the key step in the ski-rental and job-scheduling analyses (Shen et al., 5 Feb 2025).

In semiparametric PPI theory, calibration is formulated at the level of the estimating equation rather than the outcome. Let $\hat Y_i = f(X_i)$ 7 be the full-data estimating function and define

$\hat Y_i = f(X_i)$ 8

A predictor $\hat Y_i = f(X_i)$ 9 is score-calibrated at the truth if

$\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 0

When this holds, the PPI influence function matches the efficient influence function, and PPI attains the semiparametric efficiency bound (Lee et al., 7 Jun 2026). In the mean case, score calibration reduces to the usual regression target $\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 1.

A different use of calibration appears in post-prediction inference. There the labeled sample is used to fit a relationship model

$\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 2

Original PostPI effectively assumes $\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 3; the moment-based generalization replaces that assumption with an explicit estimate of $\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 4, yielding

$\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 5

and a variance formula containing the factor $\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 6 so that calibration-stage uncertainty does not spuriously vanish when $\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 7 (Salerno et al., 12 Jul 2025). The authors show that naive and original PostPI can fail to control Type I error when unlabeled data dominate or when the prediction model does not capture $\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 8 (Salerno et al., 12 Jul 2025).

This suggests that calibrated prediction-powered inference is best understood as a layered concept: probability calibration governs local decision quality, score calibration governs efficiency of rectified estimating equations, and calibration-stage variance accounting governs whether nominal coverage is preserved when prediction-only data are much more abundant than labels.

3. Informative labeling, inverse-probability correction, and subgroup calibration

Standard PPI assumes simple random labeling. When the probability of observing $\theta^* = \theta_N = \frac{1}{N}\sum_{i=1}^N Y_i.$ 9 varies across units,

$\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 0

the unweighted residual average is biased for the population-average residual, and naive PPI no longer guarantees design-unbiasedness or calibrated coverage (Datta et al., 13 Aug 2025). The inverse-probability-weighted extension replaces the original rectifier by Horvitz–Thompson or Hájek forms: $\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 1 where $\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 2. The resulting estimators

$\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 3

are design-unbiased (HT) or approximately unbiased (Hájek) under MAR with correct propensities (Datta et al., 13 Aug 2025). When $\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 4 is constant, the method reduces to standard PPI.

The survey-sampling interpretation makes this correction natural. In the missing-data notation, Horvitz–Thompson and Hájek estimators are

$\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 5

and IPW-PPI simply applies the same logic to the residual term rather than directly to $\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 6 (Datta et al., 13 Aug 2025). The survey-sampling literature also supplies calibration estimators and domain-specific diagnostics, which the recent PPI literature identifies as part of PPI’s statistical ancestry (Mozer, 19 Mar 2026).

A further complication is subgroup-specific error. If one uses a pooled rectifier for an average treatment effect, the bias is

$\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 7

where $\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 8 is the arm-specific prediction error (Mozer, 19 Mar 2026). This bias does not vanish merely by increasing the number of labels if differential prediction error persists across arms. The remedy is to use arm-specific rectifiers,

$\hat{\theta}_{\mathrm{PPI}} = \underbrace{\frac{1}{N}\sum_{i=1}^N \hat Y_i}_{\text{prediction term}} - \underbrace{\frac{1}{n_{\mathrm{lab}}}\sum_{i:R_i=1}(\hat Y_i-Y_i)}_{\text{rectifier term}},$ 9

so that treatment-specific residuals are calibrated separately (Mozer, 19 Mar 2026). A plausible implication is that “calibrated PPI” for causal or subgroup estimands should generally mean calibration within relevant strata, not only overall calibration.

4. Generalized constructions: moment correction, empirical likelihood, and Bayes-assisted rectification

Several recent extensions move beyond the baseline difference-estimator form while preserving the core principle that predictions must be bias-corrected rather than trusted.

The moment-based generalization to PostPI rewrites the target coefficient as

$R_i\in\{0,1\}$ 0

then estimates both terms instead of dropping $R_i\in\{0,1\}$ 1. Its asymptotic covariance is

$R_i\in\{0,1\}$ 2

where $R_i\in\{0,1\}$ 3 and $R_i\in\{0,1\}$ 4. The $R_i\in\{0,1\}$ 5 factor is the paper’s explicit device for preserving calibration variability; without it, uncertainty from the labeled calibration stage is underestimated when $R_i\in\{0,1\}$ 6 (Salerno et al., 12 Jul 2025).

Empirical-likelihood-based PPI replaces explicit rectifier algebra by a calibrated empirical distribution over labeled units. The empirical likelihood maximizes $R_i\in\{0,1\}$ 7 subject to

$R_i\in\{0,1\}$ 8

where $R_i\in\{0,1\}$ 9 is the supervised estimating equation and $Y_i$ 0 is a centered auxiliary moment built from predictions (Wang et al., 18 Dec 2025). The resulting weights define a calibrated empirical distribution

$Y_i$ 1

that simultaneously satisfies supervised and prediction-based moment constraints. Under the paper’s regularity conditions, the estimator is asymptotically normal, has asymptotic variance no larger than the fully supervised estimator, and attains the semiparametric efficiency bound when the auxiliary span contains the predictable component of the supervised score (Wang et al., 18 Dec 2025). Empirical-likelihood-ratio statistics have chi-squared-type limits, which yields calibrated confidence sets without relying solely on Wald approximations.

A distinct route is Bayes-assisted rectifier inference. FAB-PPI places a prior on the rectifier

$Y_i$ 2

then uses the FAB construction to obtain a confidence region for $Y_i$ 3 with pointwise frequentist coverage and shorter expected length near high-prior regions (Cortinovis et al., 4 Feb 2025). With a horseshoe prior, the method shrinks strongly toward $Y_i$ 4 when the rectifier is small but asymptotically reverts to standard PPI in the tails; the paper further shows that the horseshoe version is consistent whenever PPI is consistent, whereas the Gaussian-prior version is not (Cortinovis et al., 4 Feb 2025). This directly addresses a common misconception: introducing prior information does not necessarily sacrifice frequentist calibration, but prior choice matters sharply for robustness.

Bayesian PPI takes a broader Monte Carlo approach. For basic difference estimators, Bayesian credible intervals are reported to be “virtually identical” to the classical PPI intervals; the same framework supports task-specific proxy estimands such as chain-rule decompositions for discrete autoraters and stratified estimators for nonlinear autorater–human relationships (Hofer et al., 2024). In the reported synthetic coverage studies, chain-rule PPI achieved empirical coverage near 95% at nominal 95%, including stress tests that condition on problematic parameter values (Hofer et al., 2024).

5. Sequential, risk-controlling, and online decision extensions

The fixed-time PPI literature has been extended in three directions: anytime-valid sequential inference, semi-supervised risk control for prediction sets, and algorithmic decision rules driven by calibrated probabilities.

Anytime-valid, Bayes-assisted PPI considers streaming labeled and unlabeled data. It retains the decomposition

$Y_i$ 5

with

$Y_i$ 6

and constructs confidence sequences using Ville’s inequality and the method of mixtures (Kilian et al., 23 May 2025). The rectifier confidence sequence is Bayes-assisted through a prior on the standardized $Y_i$ 7, while the fit term receives a standard asymptotic confidence sequence. The resulting intervals are valid uniformly over time, so optional stopping and continuous monitoring do not invalidate coverage (Kilian et al., 23 May 2025).

Semi-supervised risk control adapts PPI to prediction sets and other rule-tuning problems. In the general RCPS construction, a hyper-parameter $Y_i$ 8 is chosen so that the risk $Y_i$ 9 stays below $\theta^*$ 0 with probability at least $\theta^*$ 1 over calibration data. The semi-supervised extension replaces the labeled-only risk estimator by a PPI risk estimator with unlabeled pseudo-losses plus a labeled rectifier, and proves that the selected rule still satisfies

$\theta^*$ 2

for both general bounded losses and binary losses (Einbinder et al., 2024). Data-efficient prediction-powered calibration via cross-validation then removes the labeled-data split between fine-tuning and bias correction. With $\theta^*$ 3-fold cross-prediction, the CPPI estimator

$\theta^*$ 4

uses all labeled examples for both tasks while preserving the $\theta^*$ 5-reliability guarantee for the selected prediction set (Yoo et al., 27 Jul 2025).

In online algorithms with predictions, calibration acts directly on decision rules rather than confidence sets. The formal setting specifies instances $\theta^*$ 6, features $\theta^*$ 7, target $\theta^*$ 8, predictor $\theta^*$ 9, and expected competitive ratio as the performance metric. For ski rental, calibration is used with target $f$ 0, max calibration error $f$ 1, and mean-squared error $f$ 2; the calibrated ski-rental algorithm chooses

$f$ 3

and the paper proves

$f$ 4

For job scheduling, the calibrated probabilities $f$ 5 enter the $f$ 6-threshold rule, and the resulting regret bounds improve as within-bucket variances $f$ 7 increase, showing that finer-grained calibrated scores reduce expected inversions and excess cost (Shen et al., 5 Feb 2025). The comparison with conformal prediction in the same paper further shows that interval coverage and calibration need not be interchangeable: in high-variance ski-rental instances, large conformal intervals can force fallback to the break-even policy, whereas calibrated probabilities remain informative (Shen et al., 5 Feb 2025).

6. Multi-task and multi-source calibration, empirical findings, and recurring limitations

Recent work has generalized calibrated PPI to many related tasks and to multiple pseudo-labeled sources. In the multi-task setting, each task $f$ 8 has its own target mean $f$ 9, labeled subset $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 0, and proxy $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 1. Cross-task recalibration learns a shared map $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 2 on labeled data from other tasks and then plugs $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 3 into task-specific PPI estimators. The main theoretical result is negative as well as positive: if $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 4 is affine, then oracle power-tuned variance is unchanged, so affine recalibration is asymptotically equivalent to using the original proxy; efficiency gains beyond power-tuned PPI are possible if and only if the regression $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 5 is non-affine on the observed support (Emmenegger et al., 28 May 2026). The proposed GRePPI and ARePPI procedures exploit this nonlinear structure; in a 2024 U.S. election auditing study with 72 tasks and $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 6 labels per task, cross-task isotonic recalibration substantially reduced confidence interval widths when labels were scarce (Emmenegger et al., 28 May 2026).

The multi-source extension instead aggregates several pseudo-labeled datasets. In the homogeneous case, each source $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 7 contributes a calibrated modified risk

$\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 8

and MPPI chooses weights $\theta^\star = \arg\min_{\theta\in\mathbb{R}^d}\mathbb{E}[\mathcal{L}_\theta(X,Y)],$ 9 on the simplex by minimizing

$Y_i$ 00

the asymptotic log-volume of the resulting confidence region (Li et al., 19 Jun 2026). The paper proves asymptotic normality and oracle-volume equivalence under homogeneous sampling, covariate shift, and domain shift. In the DEXA application, MPPI reduced normalized confidence-region volume from $Y_i$ 01 to $Y_i$ 02 for men and from $Y_i$ 03 to $Y_i$ 04 for women while keeping estimates close to target-only inference (Li et al., 19 Jun 2026).

The empirical record across the literature is broadly consistent. In the informative-labeling simulation for IPW-PPI with $Y_i$ 05, true mean $Y_i$ 06, and estimated propensities, weighted PPI had mean $Y_i$ 07, bias $Y_i$ 08, mean 95% width $Y_i$ 09, and coverage $Y_i$ 10, while the unweighted classic estimator was biased and undercovered (Datta et al., 13 Aug 2025). On NHANES BMI, weighted PPI had empirical bias $Y_i$ 11 and mean 95% width $Y_i$ 12, compared with HT width $Y_i$ 13 (Datta et al., 13 Aug 2025). In the online-algorithm case studies, the calibrated ski-rental policy achieved the best average competitive ratio on Citi Bike data, and histogram-calibrated sepsis probabilities yielded consistently lower excess scheduling cost than binary calibration surrogates (Shen et al., 5 Feb 2025).

At the same time, the literature converges on several limitations. Calibration assumptions are usually imposed rather than learned end-to-end. Many guarantees are asymptotic or expectation-based rather than finite-sample and distribution-free. Estimated propensities, learned cross-task recalibrators, and transport maps require correct specification or stability conditions. Over-tuning on the same labeled data can damage coverage, as shown for some regression-tree-based Bayesian PPI variants (Hofer et al., 2024). And several papers emphasize that performance gains depend on predictive structure that can actually be exploited: score calibration is required for semiparametric optimality, affine recalibration cannot improve beyond power-tuned PPI, and subgroup-specific residual error requires subgroup-specific rectification (Lee et al., 7 Jun 2026, Emmenegger et al., 28 May 2026, Mozer, 19 Mar 2026).

Taken together, these results support a precise interpretation of calibrated prediction-powered inference: predictions can improve inference or downstream decisions only when they enter through a calibrated interface—rectified residuals, score-aligned estimating equations, calibrated empirical weights, or probability-calibrated decision rules—and that interface must be engineered so that nominal validity survives the use of abundant but imperfect pseudo-information.