Papers
Topics
Authors
Recent
Search
2000 character limit reached

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

Published 15 Jun 2026 in stat.ME, cs.AI, econ.EM, and math.ST | (2606.17165v1)

Abstract: Organizations and researchers show increasing interest in using LLMs in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes recovers the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs. The framework shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. When these conditions fail, the effect of interest is only partially identified, and we provide diagnostics that can falsify surrogacy on historical experiments together with a bound on the worst-case bias from limited overlap. We further show that the stochasticity inherent to LLMs introduces both bias and variance, but using an average of multiple draws as the surrogate mitigates both. We illustrate the methods and theory in simulations and an application to A/B tests on Upworthy headlines. A central takeaway from our work is that the validity of LLM outcomes as surrogates can only be falsified for past treatments and never verified for new ones, so human experiments remain indispensable for novel interventions. We discuss the role of LLM choice, prompting, and temperature as design variables, and how to size human experiments for validation.

Summary

  • The paper introduces a principled surrogacy framework grounded in surrogate endpoint theory to enable valid causal inference on human ATEs using LLM-generated outcomes.
  • The methodology leverages multi-draw surrogacy to mitigate LLM stochasticity, demonstrating that increasing replication reduces bias and mean squared error.
  • Practical diagnostics and empirical validations, including nonparametric calibration on headline A/B tests, underscore the necessity of human pilot studies.

Statistical Surrogacy Foundations for LLM-Based A/B Testing

Introduction

The use of LLMs as proxies for human participants in A/B testing has rapidly become both technically feasible and economically attractive. However, leveraging LLM outcomes for causal inference concerning human populations introduces substantial identification challenges. "Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference" (2606.17165) develops a principled framework, grounded in surrogate endpoint theory, that rigorously specifies the assumptions, estimation strategy, and diagnostic procedures necessary for LLM-generated outcomes to support valid causal inference on human average treatment effects (ATEs).

Formal Problem Setup

The paper models experiments with randomized assignment W∈{0,1}W\in\{0,1\} (control/treatment), covariates XX, and human outcome YY, with Y∗Y^* denoting an outcome stochastically generated by an LLM, conditioned on (W,X)(W, X). The LLM surrogate framework considers two samples:

  • The experimental (human) sample, with both (W,X,Y∗,Y)(W, X, Y^*, Y) observed;
  • The artificial (LLM) sample, with only (W,X,Y∗)(W, X, Y^*) observed.

The fundamental question is: under what conditions can inference on the human ATE, τ=E[Y(1)−Y(0)],\tau = \mathbb{E}[Y(1) - Y(0)], be carried out using only LLM data?

Identification Theory: Surrogacy and Comparability

Perfect distributional equivalence between LLM and human outcomes is too strong and rarely plausible. The paper hence introduces a surrogacy perspective, mirroring classical surrogate endpoint literature, and rigorously formalizes the conditions for point identification:

  1. Surrogacy (Prentice Criterion): Y⊥W∣X,Y∗Y \perp W \mid X, Y^* (i.e., conditional on covariates and the surrogate, treatment has no further effect on the human outcome);
  2. Comparability: The conditional distribution of YY given XX0 is invariant across human and LLM samples, i.e., XX1, where XX2 indexes the sample.

Under these assumptions, the calibration function XX3, learned on the human sample, enables unbiased estimation of the human ATE from the artificial sample via

XX4

(Figure 1)

Figure 1: Sampling distribution of the calibrated ATE (blue) versus the raw LLM ATE (orange), illustrating removal of bias via calibration.

This reduction elucidates that distributional shifts in the relationship between the LLM surrogate and human outcome are absorbed by the nonparametric calibration, aligning LLM-based estimates with the estimand of interest.

Stochasticity and Multi-Draw Surrogacy

A key technical contribution is the explicit analysis of LLM stochasticity. Since LLM outputs are inherently noisy, single-draw surrogates introduce both variance inflation and, critically, attenuation bias in the calibrated estimator—a direct analog to classical measurement error.

To address this, the paper develops and proves multi-draw surrogacy theorems. When XX5 independent LLM draws are averaged per unit, the surrogate approaches the latent conditional mean, restoring identification and attenuating estimator bias as XX6. This yields an operational guideline: increasing replication count XX7 both improves efficiency (reducing MSE as XX8) and debiases ATE estimation. These findings are supported by extensive simulation. Figure 2

Figure 3: Averaging XX9 LLM draws per unit asymptotically restores surrogacy when only the latent mean contains treatment signal.

Figure 4

Figure 5: RMSE of the estimator drops as YY0 and mean estimate approaches the true ATE as YY1 increases, confirming variance reduction and debiasing.

Diagnostics: Falsification and Sensitivity Analysis

The framework distinguishes between empirical diagnostics that can falsify necessary assumptions and those that merely assess plausibility:

  • Surrogacy Falsification Test: On held-out historical data, the fitted calibration function YY2 must recover per-arm means; significant residuals indicate failure.
  • Comparability Sensitivity Bound: The worst-case impact of lack of distributional overlap between experimental and artificial samples is quantified by a tight bound (function of total variation distance and outcome range), providing robust assessment when identification fails. Figure 6

    Figure 6: Bias of the calibrated ATE scales linearly with the violation magnitude of surrogacy (direct effect of YY3) and comparability (shift in calibration slope).

    Figure 7

    Figure 2: Theoretical worst-case bound holds strictly above observed ATE discrepancies under deliberate overlap violations.

Empirical Study: Upworthy Headline Experiments

The framework is validated on the Upworthy Research Archive [matias2021upworthy], consisting of thousands of live headline A/B tests.

  • Treatment is whether a headline is phrased as a question; outcome is click-through rate (CTR).
  • GPT-4o-mini is prompted to predict CTRs for held-out headlines with varying YY4.
  • Nonparametric calibration (RF, GBT) yields LLM-calibrated ATE estimates within statistical error of the true human ATE for YY5; uncalibrated or linear calibration fails, producing significant attenuation. Figure 8

    Figure 4: Estimated ATEs for raw and calibrated LLM surrogates as a function of YY6; nonparametric calibration rapidly eliminates attenuation.

Comprehensive diagnostics confirm absence of LLM memorization, robustness to shifting test populations, and validate recovery of synthetic treatment effects. Figure 9

Figure 7: Token-level F1 assessment rules out LLM memorization of Upworthy headlines.

Practical Implications for Experimentation

The calibration-plus-surrogacy method subsumes LLM design (training regimen, prompt template, temperature) and replication count into the experimental workflow. The calibration function can and should be fit flexibly, but LLM stochasticity sets a lower bound on replication requirements for bias and variance. Diagnostics on historical data can only falsify surrogacy for past treatments, not verify it for novel ones—implying that human experimentation remains essential for new interventions. The work further discusses the trade-off in sizing human pilot studies versus purely LLM-based simulation based on risk and impact.

Conclusion

This work rigorously specifies when and how LLM-generated outcomes can be used for human causal inference in A/B testing. The statistical surrogacy formalism exposes the precise role of LLM calibration, the essential nature of surrogacy and comparability, and quantifies the implications of LLM stochasticity and finite draws. Importantly, the framework demonstrates that LLM-based causal inference is not a turnkey replacement for human experimentation: diagnostics can only falsify, not verify, key identification assumptions, making human experiments indispensable for genuinely novel treatment effects. Further directions include relaxing SUTVA, handling multiple outcomes/agents, and extending estimation to long-term surrogate settings.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 9 likes about this paper.