Recalibrated Prediction Powered Inference (RePPI)
- RePPI is a family of statistical inference methods that integrates limited gold-standard labels with abundant ML predictions to correct bias and reduce variance.
- It employs a learned recalibration mapping to adjust surrogate predictions, achieving both unbiased estimation and efficiency across frequentist and Bayesian approaches.
- RePPI extends to M-estimation, risk-controlled prediction sets, and performative prediction, offering robust performance across diverse applications.
Recalibrated Prediction Powered Inference (RePPI) is a family of statistical inference methodologies that optimally combine a small labeled/gold-standard dataset and a large bank of ML predictions to produce estimators and confidence intervals with guaranteed validity and minimal variance. RePPI generalizes earlier prediction-powered inference (PPI) frameworks by introducing a learned recalibration component—typically, a mapping or model that projects ML predictions onto the true outcome space—thereby controlling both bias and variance, even when the surrogate predictions are systematically imperfect. This framework extends from population means to M-estimation, risk-controlled prediction sets, sub-instance evaluation metrics, and performative (feedback-loop) settings, and can be instantiated with either frequentist or fully Bayesian recalibration procedures.
1. Conceptual Foundations
RePPI is grounded in the prediction-powered inference paradigm, where predictions from a (potentially biased) automatic system are supplemented with a small set of labeled or gold-standard responses to debias statistical estimates and achieve greater efficiency.
- Standard PPI: Constructs unbiased estimators and valid confidence sets by rectifying the bias induced by replacing with using the average residual computed from labeled data (Angelopoulos et al., 2023).
- Limitation: If is miscalibrated or exhibits systematic bias, naive plug-in PPI may fail to reduce variance over classical estimators and may even perform worse.
- Recalibration principle: RePPI seeks a mapping or adjustment , learned from the labeled data, that minimizes the mean squared error between predictions and observed outcomes. Plugging this recalibrated surrogate into estimation guarantees both unbiasedness and minimal variance (Ji et al., 16 Jan 2025, Chen et al., 8 Jan 2026, Hofer et al., 2024).
2. Methodological Frameworks
2.1 M-Estimation and Imputed Loss
Given labeled data and unlabeled data with ML predictions, the estimation target is typically expressed as
RePPI proceeds via the following steps (Ji et al., 16 Jan 2025, Song et al., 28 Jan 2026):
- Imputation Learning: Fit a regression on the labeled set to approximate .
- Estimator Construction: Use the recalibrated imputed loss in place of the naive surrogate to define
- Bias Correction: Optionally, augment with bias corrections on the labeled set (see influence-function approaches and efficient augmentation (Song et al., 28 Jan 2026, Zhang et al., 3 Feb 2026)).
2.2 Bayesian Recalibration
A fully Bayesian RePPI formalism posits a latent calibration parameter in a generative model relating ML scores and human labels , e.g., with a logistic calibration. Posterior inference yields a distribution over the proxy population mean (Hofer et al., 2024):
Monte Carlo samples over posterior draws produce credible intervals.
2.3 Informative Labeling and Inverse Probability Weighting
RePPI admits valid inference under informative (non-MCAR) labeling by replacing the standard residual correction with a Horvitz–Thompson (HT) or Hájek adjustment using estimated propensities (Datta et al., 13 Aug 2025):
Unbiasedness and -consistency are retained under standard regularity (correct propensity model, overlap).
3. Theoretical Guarantees
- Unbiasedness: For the HT-form, under correct model/MCAR (or correct IPW under MAR) (Datta et al., 13 Aug 2025, Song et al., 28 Jan 2026).
- Variance Minimization: The optimal recalibration mapping achieves the smallest possible asymptotic variance among all PPI-type estimators (Ji et al., 16 Jan 2025, Chen et al., 8 Jan 2026).
- Efficient Influence Function (EIF): RePPI estimators can be cast as one-step EIF corrections; in many settings (e.g., scalar means), this recovers semiparametric efficiency (Chen et al., 8 Jan 2026, Zhang et al., 3 Feb 2026).
- Confidence Interval Construction: Valid CIs can be built using the empirical influence function, sandwich variance estimators, or Bayesian credible intervals. In practice, RePPI intervals are typically 10–20% narrower than classical IPW or PPI (Ji et al., 16 Jan 2025, Datta et al., 13 Aug 2025, Hofer et al., 2024).
- Plug-in for Decision-dependent Distributions: In performative prediction, a two-step plug-in procedure with RePPI on distributional parameters attains the semiparametric efficiency bound for the performative optimum (Zhang et al., 3 Feb 2026).
4. Applications and Empirical Results
RePPI has been applied in diverse domains:
- Biomedical and Social Science Prediction: Estimating regression coefficients with large-scale ML surrogates for costly or missing outcomes (Ji et al., 16 Jan 2025, Song et al., 28 Jan 2026).
- LLM-as-a-Judge and Ranking Metrics: Estimation of Precision@K and other sub-instance metrics in retrieval and RAG systems with LLM-annotated relevance, incorporating isotonic regression recalibration of LLM probabilities (Divekar et al., 26 Jan 2026).
- Risk-controlling Prediction Sets: Semi-supervised calibration of risk-controlling set size or coverage parameters, dramatically shrinking prediction sets while preserving formal error guarantees (Einbinder et al., 2024).
- Performative Prediction: Estimation of optimal parameters in feedback-loop systems with unknown but recalibrated outcome distributions (Zhang et al., 3 Feb 2026).
Empirical benchmarks consistently show that RePPI-based estimators retain nominal coverage while significantly shrinking confidence set widths and reducing the labeling burden—for instance, 24%–36% reduction in labeled data requirement for equivalent precision in several real-world studies (Ji et al., 16 Jan 2025, Hofer et al., 2024).
5. Algorithmic Implementations
RePPI implementations commonly employ sample-splitting or cross-fitting to avoid bias from overfitting the recalibration model. Three-fold splits or K-fold cross-fitting are standard:
- Initial fit: Estimate or calibration parameters on part of the data.
- Recalibration: Fit the imputation function (which could be nonparametric, e.g., random forest, splines, isotonic, or quantile mapping).
- Aggregation: Pool predictions and corrections across folds.
- Estimation and Inference: Optimize the recalibrated objective (convexity typically preserved), estimate variance or derive credible intervals.
Select practical workflows are summarized below.
| Domain | Recalibration Step | Correction Mechanism |
|---|---|---|
| Regression/Mean | Nonparametric | Plug-in, bias-correct |
| Binary Metrics | Isotonic/Platt | Residual/EIF |
| Ranking/LLM-Judge | Isotonic on LLM proba | PPI++ w/calibrated LLM |
| Informative Labeling | Propensity-weighted residual | HT/Hájek estimator |
| Risk Control | Calibrated predictive loss | Finite-sample UCB |
| Performative Opt. | Cross-fit EIF for | Plug-in/IS |
6. Diagnostic Tools and Assumptions
Key requirements and diagnostics:
- Labeling Mechanism: MCAR for standard RePPI, MAR with correct IPW for informative labeling (Datta et al., 13 Aug 2025).
- Prediction Independence: ML predictor must be trained on disjoint data; double-dipping causes anti-conservative inference (Song et al., 28 Jan 2026).
- Overlap/Positivity: Propensity scores must be bounded away from zero in IPW-based RePPI (Datta et al., 13 Aug 2025).
- Calibration Model Fit and Diagnostics: Residual and coverage diagnostics, sensitivity analyses for recalibration function misspecification.
- Sample Size: Adequate support in the calibration set for nonparametric estimation; regularization as needed (Chen et al., 8 Jan 2026).
7. Extensions and Related Work
RePPI operates in close relation to classical surrogate outcome, double-sampling, and survey sampling strategies. It admits generalizations to:
- General loss and M-estimation settings (Ji et al., 16 Jan 2025).
- Sub-instance level evaluation and ranking metrics via PRECISE framework (Divekar et al., 26 Jan 2026).
- Risk-controlled prediction sets and semi-supervised coverage guarantees (Einbinder et al., 2024).
- Bayesian and chain-rule/stratified recalibration for structured or abstaining ML annotators (Hofer et al., 2024).
- Performative environments with parameter-dependent data distributions (Zhang et al., 3 Feb 2026).
Across settings, the recalibration-driven efficiency improvements and unbiasedness are protected under model-robust conditions and careful algorithmic design. As the paradigm evolves, open queries include optimizing calibration for multi-dimensional or instance-varying surrogates, robustness to distribution shifts, and scalable cross-fitting implementation.
References
(Angelopoulos et al., 2023, Hofer et al., 2024, Einbinder et al., 2024, Ji et al., 16 Jan 2025, Datta et al., 13 Aug 2025, Chen et al., 8 Jan 2026, Divekar et al., 26 Jan 2026, Song et al., 28 Jan 2026, Zhang et al., 3 Feb 2026)