Surrogacy Falsification Test
- The surrogacy falsification test is a statistical framework that uses calibration functions and cross-fitting to assess if LLM-generated surrogate outcomes reliably replace human outcomes.
- It employs moment-based diagnostics by testing for uncorrelated residuals across treatment arms to evaluate the key surrogacy and comparability assumptions.
- The methodology also establishes bias bounds via total variation distances, highlighting the importance of support overlap to minimize estimation errors in ATE.
A surrogacy falsification test is a statistical methodology designed to empirically test the validity of surrogate outcomes—such as those generated by LLMs—in place of human outcomes, specifically in the context of A/B testing and causal inference. The framework provides rigorous diagnostics for determining whether causal effects estimated from surrogate outcomes can be transported to the human population of interest, and, crucially, whether such surrogacy can be empirically falsified with historical experimental data. The methodology adapts surrogate endpoint theory from biostatistics to the causal inference setting, with extensions to the practicalities of LLM-based evaluation (Persson et al., 15 Jun 2026).
1. Conceptual Foundations
The surrogacy falsification framework formalizes the conditions under which surrogate endpoints—especially LLM-generated outcomes—can be reliably substituted for human outcomes in the identification and estimation of average treatment effects (ATE). It operates in settings with two populations: a calibration population (), where both human and surrogate outcomes are observed in randomized experiments, and an “artificial” LLM population (), where only surrogate outcomes are available.
The central constructs are:
- Calibration function , mapping covariates and surrogate outcome to the expected human outcome .
- Surrogacy (Assumption S): Human outcome is independent of treatment given covariates and surrogate in the calibration population, i.e., 0.
- Comparability (Assumption C): Conditional law of 1 given 2 is the same in calibration and LLM populations: 3, with requisite support-overlap in 4.
If both S and C hold, the ATE can be transported via the calibration mapping: 5.
2. Diagnostic Test for Surrogacy
The surrogacy falsification test operationalizes the Prentice criterion by statistically evaluating, in the calibration sample (6), whether the residualized human outcome 7 is uncorrelated with treatment 8 in each treatment arm. This forms the basis of an arm-wise moment test:
- For 9, test the null hypothesis 0.
- 1 is estimated by cross-fitting: split the data into 2 folds; in each, fit 3 on 4 folds, predict residuals on holdout, and aggregate by arm.
A large deviation of the sample mean residual 5 from zero, relative to its standard error, leads to rejection of surrogacy in arm 6. Rejection in either arm is a falsification of surrogacy for those historical treatments.
Pseudocode Implementation
5 If surrogacy is rejected in any arm, the application of surrogate-based inference for those past interventions is empirically falsified (Persson et al., 15 Jun 2026).
3. Bias Bound under Limited Overlap
When the comparability (support-overlap in 7) condition is violated, a worst-case bound on the bias in the estimated ATE transported from the LLM surrogate to the human target is available. Define 8, and for 9, let 0 be the density of 1 in cell 2. The total-variation distance in arm 3 is 4.
Given bounded outcomes 5, the difference between the human ATE (6) and the surrogate-based ATE (7) is tightly bounded: 8 In practical terms, a high degree of support overlap is necessary for the calibration to have a small worst-case bias. The total-variation distances can be estimated via density-ratio techniques or two-sample classifiers.
4. Regularity and Validity Conditions
Validity of the moment-based falsification test and the bias bound require standard regularity conditions:
- Cross-fitting: Residuals used in the test are approximately independent of the estimated calibration function 9, avoiding overfitting-induced correlation.
- Sample size: Each treatment arm should have at least 200 samples to reliably invoke the central limit theorem; for smaller samples, t-tests with bootstrapped standard errors are recommended.
- Bounded outcomes: For the bias bound, 0 must be bounded, which is often achieved by construction (e.g., click rates in 1).
- Support overlap: Nonzero support for 2 in each (p, w) cell is required; otherwise the bias bound defaults to the trivial maximum.
- Covariate comparability: Distributions of 3 and the noise properties of 4 must be similar in calibration and LLM samples to avoid inflation of the bias bound; this is checked empirically via diagnostic plots (e.g., propensity scores, residual distributions, and normal-qq plots).
5. Practical Recommendations for Implementation
Several methodological choices strongly influence the power and robustness of the surrogacy falsification test:
- Calibration Estimator: Employ flexible machine learning models (random forests, gradient-boosted trees) for the calibration function 5 when 6; otherwise, use regularized linear or spline regression.
- Residual Replication: For stochastic surrogates, average multiple LLM outputs 7 per unit to reduce surrogate noise and increase test power; 8 to 9 draws often suffices.
- Design Variables: Tune LLM sampling temperature to achieve a balance between surrogate diversity and signal-to-noise ratio.
- Sample Sizing: Target at least 200 samples per arm for reliable asymptotic inference; more historical calibration experiments yield increased sensitivity to surrogacy violations.
- Overlap Enhancement: If support overlap is inadequate (high 0), either expand the calibration sample or restrict inference to subpopulations with acceptable overlap coefficients (1).
- Pilot Experimentation: Even after a passed falsification test, a small-scale human experiment is recommended for any new intervention due to the inherent untestability of surrogacy for treatments outside observed historical support.
6. Limitations and Scope
The surrogacy falsification test provides necessary—but not sufficient—conditions for the use of surrogate outcomes for causal inference. While the methodology can conclusively falsify surrogacy for past interventions using historical experiments, it cannot confirm surrogacy validity for novel treatments. The validity of LLM-based surrogates is thus always at most empirically conditionally supported, never guaranteed for policy-relevant future interventions. Consequently, small-scale human pilots remain indispensable for validation of entirely new interventions, even in cases where the test does not reject surrogacy in historical data (Persson et al., 15 Jun 2026).
| Assumption | Definition | Empirical Testability |
|---|---|---|
| Surrogacy (S) | 2 | Diagnostic moment test |
| Comparability (C) | 3 (with support overlap in 4) | Overlap diagnostics, TV bound |
A plausible implication is that stringent ongoing calibration and overlap monitoring are required in any workflow seeking to exploit surrogate-based estimation of human treatment effects.