Papers
Topics
Authors
Recent
Search
2000 character limit reached

Surrogacy Falsification Test

Updated 23 June 2026
  • The surrogacy falsification test is a statistical framework that uses calibration functions and cross-fitting to assess if LLM-generated surrogate outcomes reliably replace human outcomes.
  • It employs moment-based diagnostics by testing for uncorrelated residuals across treatment arms to evaluate the key surrogacy and comparability assumptions.
  • The methodology also establishes bias bounds via total variation distances, highlighting the importance of support overlap to minimize estimation errors in ATE.

A surrogacy falsification test is a statistical methodology designed to empirically test the validity of surrogate outcomes—such as those generated by LLMs—in place of human outcomes, specifically in the context of A/B testing and causal inference. The framework provides rigorous diagnostics for determining whether causal effects estimated from surrogate outcomes can be transported to the human population of interest, and, crucially, whether such surrogacy can be empirically falsified with historical experimental data. The methodology adapts surrogate endpoint theory from biostatistics to the causal inference setting, with extensions to the practicalities of LLM-based evaluation (Persson et al., 15 Jun 2026).

1. Conceptual Foundations

The surrogacy falsification framework formalizes the conditions under which surrogate endpoints—especially LLM-generated outcomes—can be reliably substituted for human outcomes in the identification and estimation of average treatment effects (ATE). It operates in settings with two populations: a calibration population (P=0P=0), where both human and surrogate outcomes are observed in randomized experiments, and an “artificial” LLM population (P=1P=1), where only surrogate outcomes are available.

The central constructs are:

  • Calibration function μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0], mapping covariates XX and surrogate outcome YY^* to the expected human outcome YY.
  • Surrogacy (Assumption S): Human outcome YY is independent of treatment WW given covariates XX and surrogate YY^* in the calibration population, i.e., P=1P=10.
  • Comparability (Assumption C): Conditional law of P=1P=11 given P=1P=12 is the same in calibration and LLM populations: P=1P=13, with requisite support-overlap in P=1P=14.

If both S and C hold, the ATE can be transported via the calibration mapping: P=1P=15.

2. Diagnostic Test for Surrogacy

The surrogacy falsification test operationalizes the Prentice criterion by statistically evaluating, in the calibration sample (P=1P=16), whether the residualized human outcome P=1P=17 is uncorrelated with treatment P=1P=18 in each treatment arm. This forms the basis of an arm-wise moment test:

  • For P=1P=19, test the null hypothesis μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]0.
  • μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]1 is estimated by cross-fitting: split the data into μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]2 folds; in each, fit μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]3 on μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]4 folds, predict residuals on holdout, and aggregate by arm.

A large deviation of the sample mean residual μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]5 from zero, relative to its standard error, leads to rejection of surrogacy in arm μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]6. Rejection in either arm is a falsification of surrogacy for those historical treatments.

Pseudocode Implementation

YY5 If surrogacy is rejected in any arm, the application of surrogate-based inference for those past interventions is empirically falsified (Persson et al., 15 Jun 2026).

3. Bias Bound under Limited Overlap

When the comparability (support-overlap in μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]7) condition is violated, a worst-case bound on the bias in the estimated ATE transported from the LLM surrogate to the human target is available. Define μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]8, and for μ(x,y)=E[YX=x,Y=y,P=0]\mu(x, y^*) = \mathbb E[Y \mid X=x, Y^*=y^*, P=0]9, let XX0 be the density of XX1 in cell XX2. The total-variation distance in arm XX3 is XX4.

Given bounded outcomes XX5, the difference between the human ATE (XX6) and the surrogate-based ATE (XX7) is tightly bounded: XX8 In practical terms, a high degree of support overlap is necessary for the calibration to have a small worst-case bias. The total-variation distances can be estimated via density-ratio techniques or two-sample classifiers.

4. Regularity and Validity Conditions

Validity of the moment-based falsification test and the bias bound require standard regularity conditions:

  • Cross-fitting: Residuals used in the test are approximately independent of the estimated calibration function XX9, avoiding overfitting-induced correlation.
  • Sample size: Each treatment arm should have at least 200 samples to reliably invoke the central limit theorem; for smaller samples, t-tests with bootstrapped standard errors are recommended.
  • Bounded outcomes: For the bias bound, YY^*0 must be bounded, which is often achieved by construction (e.g., click rates in YY^*1).
  • Support overlap: Nonzero support for YY^*2 in each (p, w) cell is required; otherwise the bias bound defaults to the trivial maximum.
  • Covariate comparability: Distributions of YY^*3 and the noise properties of YY^*4 must be similar in calibration and LLM samples to avoid inflation of the bias bound; this is checked empirically via diagnostic plots (e.g., propensity scores, residual distributions, and normal-qq plots).

5. Practical Recommendations for Implementation

Several methodological choices strongly influence the power and robustness of the surrogacy falsification test:

  • Calibration Estimator: Employ flexible machine learning models (random forests, gradient-boosted trees) for the calibration function YY^*5 when YY^*6; otherwise, use regularized linear or spline regression.
  • Residual Replication: For stochastic surrogates, average multiple LLM outputs YY^*7 per unit to reduce surrogate noise and increase test power; YY^*8 to YY^*9 draws often suffices.
  • Design Variables: Tune LLM sampling temperature to achieve a balance between surrogate diversity and signal-to-noise ratio.
  • Sample Sizing: Target at least 200 samples per arm for reliable asymptotic inference; more historical calibration experiments yield increased sensitivity to surrogacy violations.
  • Overlap Enhancement: If support overlap is inadequate (high YY0), either expand the calibration sample or restrict inference to subpopulations with acceptable overlap coefficients (YY1).
  • Pilot Experimentation: Even after a passed falsification test, a small-scale human experiment is recommended for any new intervention due to the inherent untestability of surrogacy for treatments outside observed historical support.

6. Limitations and Scope

The surrogacy falsification test provides necessary—but not sufficient—conditions for the use of surrogate outcomes for causal inference. While the methodology can conclusively falsify surrogacy for past interventions using historical experiments, it cannot confirm surrogacy validity for novel treatments. The validity of LLM-based surrogates is thus always at most empirically conditionally supported, never guaranteed for policy-relevant future interventions. Consequently, small-scale human pilots remain indispensable for validation of entirely new interventions, even in cases where the test does not reject surrogacy in historical data (Persson et al., 15 Jun 2026).

Assumption Definition Empirical Testability
Surrogacy (S) YY2 Diagnostic moment test
Comparability (C) YY3 (with support overlap in YY4) Overlap diagnostics, TV bound

A plausible implication is that stringent ongoing calibration and overlap monitoring are required in any workflow seeking to exploit surrogate-based estimation of human treatment effects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Surrogacy Falsification Test.