Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI (2501.09731v1)

Published 16 Jan 2025 in stat.ML and cs.LG

Abstract: We establish a formal connection between the decades-old surrogate outcome model in biostatistics and economics and the emerging field of prediction-powered inference (PPI). The connection treats predictions from pre-trained models, prevalent in the age of AI, as cost-effective surrogates for expensive outcomes. Building on the surrogate outcomes literature, we develop recalibrated prediction-powered inference, a more efficient approach to statistical inference than existing PPI proposals. Our method departs from the existing proposals by using flexible machine learning techniques to learn the optimal ``imputed loss'' through a step we call recalibration. Importantly, the method always improves upon the estimator that relies solely on the data with available true outcomes, even when the optimal imputed loss is estimated imperfectly, and it achieves the smallest asymptotic variance among PPI estimators if the estimate is consistent. Computationally, our optimization objective is convex whenever the loss function that defines the target parameter is convex. We further analyze the benefits of recalibration, both theoretically and numerically, in several common scenarios where machine learning predictions systematically deviate from the outcome of interest. We demonstrate significant gains in effective sample size over existing PPI proposals via three applications leveraging state-of-the-art machine learning/AI models.

Summary

The paper introduces RePPI, a novel method that recalibrates AI-based surrogate outcomes to achieve lower asymptotic variance.
The methodology employs machine learning techniques like cross-fitting with random forests and gradient boosting to correct biases from modality mismatches and distribution shifts.
Empirical studies show that integrating AI-driven surrogates enlarges effective sample sizes and boosts statistical inference precision in cost-constrained settings.

Recalibration of Surrogate Outcomes in the Context of AI-Powered Predictions

The paper "Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI" establishes a connection between surrogate outcome models and prediction-powered inference (PPI), introducing recalibrated prediction-powered inference (RePPI) as an advancement in statistical efficiency. The authors position their work within the historical context of surrogate outcomes used in biostatistics and economics and extend the paradigm to encompass modern AI-derived predictions, offering a novel methodological approach poised to enhance both inference precision and cost-effectiveness in data-laden environments.

Background and Recursive Evaluation of Surrogates

Surrogate outcomes are traditionally implemented when primary data acquisition is expensive or infeasible. The canonical literature, as outlined by seminal works such as Pepe (1992) and Robins et al. (1994), has explored the use of surrogate outcomes in clinical trials and other domains where primary endpoints are costly or difficult to measure. Such surrogates are auxiliary variables, correlated with the endpoint of interest, offering an indirect path to statistical inference under certain assumptions.

Prediction-powered inference, as explored by Angelopoulos et al. (2023), leverages predictions from high-capacity models as surrogates in settings where collecting true outcomes is impractical. These model-derived predictions, while potentially biased or uncalibrated, are cost-effective and ubiquitously available, although they necessitate careful recalibration to align closer with the true outcome distribution and achieve optimal inference.

Methodology: Recalibrated Prediction-Powered Inference

The authors develop RePPI by infusing insights from the surrogate outcomes literature into the field of PPI, proposing a recalibration mechanism that employs machine learning techniques to estimate the "imputed loss" – essentially an adjusted gradient reflecting conditional expectations of true outcomes conditioned on predictions and covariates.

Theoretical Contributions

Their theoretical framework proves that RePPI is invariably more efficient than estimators neglecting predictions and reaches the lower asymptotic variance bound among all PPI variants if recalibration approximates the optimal imputed loss well enough. This theorem is mathematically anchored in assumptions paralleling those of semiparametric efficiency frameworks, where expressions like $Cov(E[\nabla\ell(X, Y)|X, \hat Y])$ emerge as pivotal efficiency parameters.

Empirical Illustrations and Numerical Studies

The paper dissects several use-cases exemplifying systematic prediction biases:

Modality Mismatch: Occurs when prediction models omit certain covariates pertinent to inference, necessitating recalibration to exploit prediction surrogates fully.
Distribution Shift: Arises as model training distributions deviate from target data distributions, a common scenario with off-the-shelf AI models applied to specialized subgroups.
Discrete Predictions: Particularly with classification tasks, discrete surrogate outcomes require approximation and smoothing to enhance relevance and precision in a numerical inference context.

Computational and Practical Implications

Practically, RePPI entails cross-fitting and leveraging powerful machine learning algorithms like random forests and gradient boosting to learn recalibrated predictions robustly. The effective sample enlargement enabled by RePPI underscores its practical ascendance by allowing statisticians and econometricians to utilize vast unlabeled datasets, thereby enhancing the reliability and precision of statistical estimates derived from smaller labeled subpopulations.

Speculations on Future AI Developments

Given the surge in pre-trained AI model availability, the role of advanced surrogate calculations will likely continue to grow in research domains emphasizing scalability and efficiency. Future iterations of AI-powered surrogate models might involve advanced zero-shot or few-shot learning methodologies, enhancing the applicability and adaptability of recalibrated inference frameworks across domains. This paper provides a structured path towards integrating modern computational methods with time-honored statistical thoroughness, shaping the future of inference in the data-rich AI era.

In essence, the authors provide a robust, theoretically justified pathway to incorporate potentially biased yet rich prediction surrogates into inferential statistics, contributing a substantive methodological innovation that is poised to influence future studies across various scientific fields that require efficient data utilization.