Surrogate Confounders in Causal Inference

Updated 2 March 2026

Surrogate confounders are latent representations constructed from observed proxies to account for unmeasured confounding in causal inference.
They enable unbiased estimation of causal effects by applying methods such as negative controls, bridge functions, and latent variable models like VAEs, ICA, and RNNs.
Applications include mediation analysis, policy evaluation, recommender systems, and longitudinal studies, offering improved bias reduction and effect estimation.

Surrogate confounders, also known as proxy confounders or latent confounders learned via surrogates, are representations or variables constructed from observed data that encapsulate the effect of unobserved confounders. These surrogates are leveraged to enable identification and unbiased estimation of causal effects or mediation pathways in the presence of latent common causes that cannot be measured directly. Surrogate confounders are foundational in modern causal inference, particularly with the advent of high-dimensional data and the realization that the ignorability (no unmeasured confounders) assumption is frequently untenable. Methods for constructing surrogate confounders range from negative-control and proxy-variable strategies to deep generative latent models and high-dimensional independent component analysis.

1. Conceptual Foundations and Definitions

A surrogate confounder is any constructed variable or learned representation that serves as a stand-in for an unobserved common cause (confounder) of variables in a causal model. The term encompasses:

Proxies derived from observed variables correlated with the latent confounder (e.g., measured noise-contaminated proxies).
Learned low-dimensional latent variables that aggregate information from high-dimensional proxy measurements via statistical or machine learning models (e.g., PCA, ICA, VAEs, RNNs).
Negative control exposures and outcomes used in the proximal inference framework, which, under specific graphical separations, serve as statistical surrogates for the true confounders.
Surrogate confounders constructed by explicit factor models under the assumption of multi-cause confounding, where one or more unknown variables jointly induce dependencies among multiple treatment or exposure variables.

Distinct from direct measurement, the surrogate confounder's validity is contingent on precise conditional independence and completeness (richness) assumptions. Surrogate confounders enable bypassing the unobtainable sequential ignorability in, e.g., mediation analysis or policy evaluation where confounding is latent (Cheng et al., 2021, Bennett et al., 2019).

2. Graphical Models, Identification, and Theoretical Guarantees

Surrogate confounders are formally defined within structural causal models (SCMs) and represented in directed acyclic graphs (DAGs) via nodes (latent confounders) inducing dependency structures among observed variables. Surrogate confounders are constructed from variables downstream of the confounder but not parents of the variables of causal interest, adhering to specific d-separation criteria and completeness conditions.

The identifiability of causal effects (direct, indirect, total, policy value) via surrogate confounders requires:

Existence of a latent variable (U) such that, given U, observed variables are independent as required (e.g., no unblocked backdoor paths between treatment and outcome/mediator) (Cheng et al., 2021).
Observed proxies are rich enough that (U,X,…) is faithful to the graph (usually that U is a common parent and X are noisy children) and that the joint law p(U,X,…) can be identified from data (often via explicit generative modeling or completeness-type conditions) (Cheng et al., 2021, Huang et al., 2024, Mankovich et al., 2024).
For high-dimensional proxies, identifiability typically exploits non-Gaussianity (ICA), limited dimension of the confounding space (low-d), and full-rank mixing between latent confounders and proxies (Mankovich et al., 2024).

Most frameworks provide theoretical guarantees: if the model assumptions hold, surrogate confounders constructed as latent representations or as functions of proxies suffice to enable unbiased estimation via backdoor adjustment or bridge function approaches. For example, under multi-cause confounding, the blessing of dimensionality ensures that the surrogate confounder d-separates causes conditional on the latent (Huang et al., 2024). For negative-control and proximal methods, completeness of the proxy kernel ensures solution uniqueness to the bridge equations (Park et al., 2023, Hung et al., 25 Jan 2026).

3. Methodologies for Surrogate Confounder Construction

3.1 Proxy Variable and Negative Control Approaches

Negative control outcomes (NCOs) and exposures (NCEs) are proxy variables introduced to detect and control for unmeasured confounding. An NCO is associated with the latent confounder but unaffected by the treatment, enabling identification of the causal effect even without the direct measurement of the confounder (Park et al., 2023). In the single-proxy setting (COCA), identification of the average causal effect for the treated (ETT) proceeds via Fredholm equations involving the extended propensity score or via outcome-bridge functions relating the NCO distribution and the observed outcomes (Park et al., 2023).
Proximal surrogate indices leverage both outcome-aligned and surrogate-aligned proxies, employing bridge function equations to transport surrogate index methods to settings with unobserved confounding. Multiply robust estimation and inference can be derived (Hung et al., 25 Jan 2026).

3.2 Latent Variable Models

Variational autoencoders (VAEs) are widely used to encode proxy measurements into latent surrogate confounders. For example, in causal mediation analysis with hidden confounders, a deep latent variable model learns a multivariate U such that proxies, treatments, mediators, and outcomes are all generated conditionally on U; the model is trained to maximize the ELBO plus auxiliary supervised losses, and the learned U is used to estimate mediation effects by backdoor adjustment (Cheng et al., 2021).
In sequential and multi-cause contexts, surrogates are learned via recurrent neural networks (RNNs)/GRUs that summarize treatment and outcome histories into a surrogate confounder embedding at each timestep (Cheng et al., 2020).
For high-dimensional proxies, surrogate confounders can be constructed via independent component analysis (ICA). Under the assumption of non-Gaussian, mutually independent latent factors and full-rank mixing, ICA recovers the latent space up to scaling and permutation, and relevant components are selected by association tests with exposures and outcomes (Mankovich et al., 2024).

3.3 Algorithmic Table

Method/Class	Proxy Construction	Identification Key
COCA/Single Proxy	Negative control outcome	Fredholm/Bridge Eqns
Proximal Surrogate Ind.	Outcome- and surrogate- proxies	Bridge eqns (completeness)
Latent variable VAE	Proxy→latent encoding	ELBO+aux. losses; d-sep.
Sequential RNN/GRU	History→state embedding	Balancing (IPM/1-Wass.)
ICA-based PCF	ICA on high-d proxies	Non-Gaussianity, low-d
Dual VAE (multi-cause)	Multi-cause sets→surrogate	Multi-cause d-sep.

4. Applications and Empirical Validation

Surrogate confounders have been developed and validated across diverse domains:

Mediation analysis: Learning latent U from proxies in observational mediation models yields lower bias and variance for direct, indirect, and total effect estimation relative to classic and IPW-based approaches, and is robust to proxy noise level (Cheng et al., 2021).
Recommendation systems: Multi-cause surrogate confounders (on both user and item sides) constructed via dual VAEs substantially reduce bias and improve recommendation accuracy over standard methods, with empirical improvements in Recall, Hit Rate, and NDCG across real datasets (Huang et al., 2024). Surrogates learned via prior recommender logs remove exposure and popularity biases (Xu et al., 2023).
Long-term effect estimation: RNN-encoded surrogate confounders summarize time-varying unobserved confounding in longitudinal studies. Conditioning on these surrogates enables unbiased estimation of primary (long-term) treatment effects, validated by improved bias/variance balance in synthetic and real data settings (Cheng et al., 2020).
High-dimensional continuous treatment: ICA-PCF applied to climate data recovers up to 75.9% of variance of the known confounder (North Atlantic Oscillation), outperforming PCA- and PLS-based alternatives in bias and causal error metrics (Mankovich et al., 2024).
Policy evaluation: Surrogate confounders constructed from proxies enable adversarial balance minimization to achieve root-n consistency in off-policy evaluation despite latent confounders (Bennett et al., 2019).
Survey research: Simulations confirm that adding sufficiently informative, independent proxies steadily reduces confounding bias, but practical challenges such as noise and redundancy require careful diagnostic assessment (Press, 30 Nov 2025).

5. Assumptions, Practicalities, and Limitations

The effectiveness of surrogate confounders is predicated on several stringent model and data assumptions:

Proxies must carry sufficient information (completeness, informativeness) about the latent confounders; otherwise, residual confounding persists.
Identifiability of the surrogate depends crucially on the design: e.g., non-Gaussianity for ICA, faithfulness of the graphical model, or presence of multiple causes.
For negative control/proximal approaches, bridge completeness is essential and must be checked or argued via substantive knowledge.
Latent variable (VAE, RNN) approaches only recover an approximation to the true latent, and performance depends on the adequacy of the generative model and the ability to fit/regularize the latent bottleneck.
In high-dimensional contexts, redundancy, noise, or collinearity among proxies limits the practical effectiveness (large k may be required to reduce bias substantially) (Press, 30 Nov 2025).
Multi-cause deconfounding does not recover single-cause confounders; sparsity in data (e.g., users/items with only one history) impedes identification (Huang et al., 2024).
All methods are limited if proxies and true confounders do not satisfy the required independence or conditional-exclusion constraints, and are sensitive to violation of faithfulness or non-Gaussianity (for ICA approaches).

6. Future Directions and Open Challenges

Future research directions include:

Extending surrogate confounder frameworks to nonlinear, non-Gaussian, or time-varying causal structures.
Developing diagnostic and sensitivity analysis tools for practical assessment of surrogate quality (e.g., empirical R² diagnostics, proxy informativeness checks).
Automating the discovery and selection of valid proxy sets via data-driven algorithms using second-order rank constraints or higher-order independence conditions, as in Proxy-Rank and Proxy-GIN methods (Xie et al., 2024).
Addressing the estimation of population ATE as opposed to ETT in single-proxy settings, which currently relies on additional nested proxies or restrictive rank-preservation (Park et al., 2023).
Investigating extensions to causal inference with multiple outcomes, categorical treatments, or weak faithfulness.

7. Summary Table of Key Papers

Paper	Main Surrogate Construction	Domain/Setting
"Causal Mediation Analysis with Hidden Confounders" (Cheng et al., 2021)	Proxy-based VAE latent U	Causal mediation/fairness
"Long-Term Effect Estimation with Surrogate Representation" (Cheng et al., 2020)	RNN φ_t encoding of histories	Longitudinal effects
"Policy Evaluation with Latent Confounders via Optimal Balance" (Bennett et al., 2019)	Proxy-based posterior φ(u	x,a)
"Recovering Latent Confounders from High-dimensional Proxy Variables" (Mankovich et al., 2024)	ICA/GD decomposition of proxies	High-d continuous cases
"Multi-Cause Deconfounding for Recommender Systems..." (Huang et al., 2024)	Dual VAE on multi-cause sets	Recommender systems
"Single Proxy Control" (Park et al., 2023)	Negative control outcome (NCO)	Epidemiology/Surveys
"Automating the Selection of Proxy Variables..." (Xie et al., 2024)	Data-driven proxy set selection	Linear multi-treatment

Surrogate confounders have become central to the empirical and methodological foundations of modern causal inference, providing a systematic framework for bias mitigation and effect identification in the ubiquitous presence of unmeasured or latent confounding.