Influence-Based Adaptive Sample Borrowing
- The paper introduces an influence function-based adaptive borrowing method that quantifies the impact of each external sample on the RCT outcome model.
- It employs a two-stage data-driven procedure that ranks external controls by influence scores and selects the subset minimizing the estimated mean squared error.
- Empirical evaluations demonstrate that the approach reduces both variance and bias in treatment effect estimates compared to traditional borrowing methods.
Influence-based adaptive sample borrowing is a class of methodologies that utilize influence function theory to quantify the impact and comparability of individual external control samples when augmenting randomized controlled trials (RCTs). These techniques adaptively select which external control samples to borrow based on their influence on model parameters or loss functions, with the dual objective of maximizing the efficiency of treatment effect estimation and minimizing bias due to lack of exchangeability between external and randomized controls (Yang et al., 5 Aug 2025).
1. Influence Function Quantification of Sample Comparability
Central to this approach is the use of influence functions to measure how much the inclusion of an external sample would perturb the estimator or the loss over the RCT controls. Specifically, after fitting an outcome model to the RCT controls (parameterized by θ̂), the influence of external sample z = (x, y) is estimated as follows:
- The parameter influence is defined by
where is the Hessian of the loss function.
- The effect of z on the loss for any RCT control sample is
- The aggregate influence score used for selection is
A lower indicates greater comparability, as the external sample minimally perturbs the control fit.
2. Semiparametric Efficient Estimation under Exchangeability
When the subset of external controls selected by the above measure is exchangeable with the RCT controls—i.e., —the paper constructs a semiparametric efficient estimator for the average treatment effect (ATE) in the RCT target population. Notationally:
- ;
The efficient influence function is:
where . This can be estimated by plugging in appropriate nuisance estimators. Under correct model specification and the exchangeability assumption, this is semiparametrically efficient.
3. Bias–Variance Trade-off under Non-Exchangeability
If the exchangeability assumption fails for the selected subset, the estimator exhibits bias:
where . In this regime, the estimator’s asymptotic distribution becomes:
with as in the efficient influence function. This formalizes the bias–variance trade-off intrinsic to adaptive sample borrowing: borrowing more can reduce variance, but including poorly matched samples increases bias.
4. Data-driven Subset Selection via Estimated Mean Squared Error
To operationalize influence-based adaptive borrowing, the paper proposes a two-stage data-driven procedure:
- Ranking External Controls by Influence: Compute influence scores for all external control samples and sort them in increasing order.
- MSE-based Subset Selection: For each potential Top-K subset (of the K most comparable external controls), estimate the mean squared error (MSE) of the ATE estimator, defined as estimated . The subset which minimizes estimated MSE is chosen for borrowing. This procedure does not require universal exchangeability, instead balancing the bias–variance trade-off empirically.
5. Empirical Evaluation and Practical Performance
Simulations with both linear and nonlinear outcome models, as well as real-world datasets (e.g., the National Supported Work program and PSID), show that the influence-based method
- Consistently achieves lower MSE than benchmarks: standard AIPW (RCT only), full borrowing (all external controls), and lasso-based bias penalization methods.
- As more external controls are borrowed past the optimal K, empirical MSE starts to increase due to accumulating bias from non-comparable samples, highlighting the tunable trade-off.
- In real data, the approach simultaneously reduced the variance and bias of treatment effect estimates. This method thus substantiates its claims of improved efficiency (variance reduction) without compromising bias control.
6. Relationship to Broader Influence-based Borrowing Literature
This framework is distinguished from earlier approaches by its use of first-order influence function calculations for granular, per-sample borrowing decisions. Unlike simple calibration weighting or global penalization, it directly quantifies the impact of each external observation on the fitted RCT outcome model and propagates this to the final estimator selection via an explicit MSE minimization over all possible borrowing sets. Although sharing the core bias-variance balancing theme with related works (e.g., adaptive lasso bias-selection (Gao et al., 2023)), its use of influence functions for ranking and subset selection is novel and enables more precise, data-adaptive borrowing schemes.
7. Technical Convergence and Implementation Considerations
- The influence function framework relies on the differentiability and invertibility of the Hessian of the loss function for the fitted model.
- In high-dimensional or non-smooth models, efficient computation of Hessian–vector products and robust regularization is essential.
- The mean squared error (MSE) computation for subset selection, while empirical, presupposes reasonably accurate variance and bias estimation; thus, the method may require calibration or resampling-based error assessment in small-sample or highly non-smooth settings.
In summary, influence-based adaptive sample borrowing as formalized in (Yang et al., 5 Aug 2025) is a principled mechanism for optimizing external control integration in RCT analysis, leveraging sample-level influence quantification and bias–variance optimization to enhance statistical efficiency while controlling for potential bias due to lack of comparability. Its influence function–driven subset selection provides a powerful and rigorous foundation for adaptive borrowing in a variety of experimental designs where external data are available.