Proximal Causal Learning (PCL)

Updated 30 June 2025

Proximal Causal Learning (PCL) is a framework that uses measured proxies to address unmeasured confounding in observational causal inference.
It leverages bridge functions and the proximal g-formula to generalize traditional methods for more reliable causal effect estimates.
PCL is applicable to both point treatment and time-varying settings, offering robust identification when standard assumptions fail.

Proximal Causal Learning (PCL) is a formal framework for causal inference in observational data that systematically addresses settings plagued by unmeasured confounding, where measured covariates serve only as proxies—imperfect measurements—of the true confounding mechanisms. This approach extends beyond the conventional exchangeability assumption, providing nonparametric identification and estimation procedures for causal effects even when classic "no unmeasured confounding" conditions fail. PCL generalizes foundational causal methods such as the g-formula and g-computation, enabling practitioners to draw credible causal inferences using appropriately classified and sufficiently informative proxies.

1. Conceptual Foundations and Framework

PCL is motivated by the observation that, in practice, investigators seldom measure all confounders perfectly. Instead, covariate measurements are often noisy, capturing only partial information about the true, unobserved confounders. In the potential outcome framework, let $A$ denote treatment, $Y$ outcome, $L$ measured covariates, and $U$ the unmeasured confounders. The core estimand is the mean potential outcome under treatment $a$ , denoted $\beta(a) = \mathbb{E}[Y_a]$ .

Traditional approaches assume exchangeability: $Y_a \perp A \mid L,$ but PCL recognizes that this is rarely satisfied. Therefore, it partitions measured covariates into three groups:

$X$ : common causes of $A$ and $Y$ (possibly well-measured confounders);
$Z$ : treatment-inducing proxies correlated with unmeasured confounders and $A$ , but not direct causes of $Y$ ;
$W$ : outcome-inducing proxies correlated with unmeasured confounders and $Y$ , but not direct causes of $A$ .

A typical linear model illustrating this framework is: $\mathbb{E}[Y \mid A, Z, X, U] = \beta_0 + \beta_a A + \beta_u U + \beta_x' X, \quad \mathbb{E}[W \mid A, Z, X, U] = \eta_0 + \eta_u U + \eta_x' X,$ where $W$ proxies the latent $U$ .

2. Challenges, Assumptions, and Bridge Function Approach

Main challenges in causal learning from proxies:

Identification of causal effects is an ill-posed inverse problem due to unmeasured confounding.
Reliance on proxies introduces dependence on how well they “cover” the hidden confounder.

PCL addresses these by introducing:

A rigorous classification of measured variables into confounders and proxies.
Conditional independence assumptions, including:
- $Y \perp Z \mid A, U, X$ (treatment-proxy independency),
- $W \perp (A, Z) \mid U, X$ (outcome-proxy independency).

Completeness is crucial: proxies must be "rich enough" to allow for the inversion of the mapping from proxies to confounders. Categorially, this requires $\min(d_z, d_w) \geq d_u$ , where $d_z, d_w, d_u$ are the numbers of categories in $Z, W, U$ , respectively.

The core of identification is the bridge function:

$\mathbb{E}[Y \mid a, z, x] = \sum_w h(a, x, w) f(w \mid a, x, z),$

which is a Fredholm integral equation of the first kind, solved for $h$ .

3. Proximal Identification, g-Formula, and Algorithms

Given the above structure, the proximal g-formula identifies the mean potential outcome as

$\beta(a) = \sum_{w, x} h(a, x, w) f(w, x),$

where $h$ is the solution to the bridge equation. In the classical setting, this reduces to the standard g-formula.

Algorithm for estimation:

Specify and fit models for the proxy distribution $f$ and the bridge function $h$ .
Use penalized regression (e.g., least squares, maximum likelihood) to fit $h$ using observed $(A, X, Z, W)$ .
Estimate $\beta(a)$ by averaging fitted bridge function predictions over the observed proxies.

Special case—Proximal 2SLS: If all models are linear:

Stage 1: Predict $W$ from $(Z, A, X)$ .
Stage 2: Regress $Y$ on $(A, X, \widehat{W})$ .

This structure generalizes the familiar instrumental variable estimators to the case of proxies for confounding.

4. Sufficient Conditions, Robustness, and Generalizations

Sufficient conditions for identification include:

Conditional independence of proxies given the unobserved confounder and relevant variables.
Completeness for the proxies with respect to $U$ .
Existence of a solution to the bridge equation.

When these are satisfied, and the proxies are informative, PCL provides nonparametric identification even when standard adjustment fails.

For time-varying treatments:

The approach generalizes recursively, with sequential bridge functions that allow for the estimation of longitudinal or dynamic causal effects.
The "longitudinal proximal g-formula" provides a path to identification in settings where the sequential randomization assumption cannot be justified.

5. Applications and Empirical Illustration

SUPPORT Study (Right Heart Catheterization):

Treatment: RHC (Yes/No), Outcome: 30-day survival.
Multiple measured physiological covariates: 10 are candidate proxies for unmeasured severity.
Proxies $Z$ and $W$ allocated based on observed associations.
Standard OLS: $-1.25$ days (SE 0.28); Proximal 2SLS: $-1.80$ days (SE 0.43). Conventional methods understate harm attributable to unmeasured confounding.

Longitudinal Methotrexate Study:

Methotrexate therapy in rheumatoid arthritis patients.
Time-varying proxies assigned at each visit.
Recursive algorithm with linear bridge shows a more protective effect than traditional methods, highlighting PCL's ability to correct for longitudinal latent confounding.

6. Point Treatment vs. Time-Varying Settings and Implementation Considerations

Point Treatment:

Proximal identification and estimation as above, with possible application of 2SLS-type algorithms when bridge functions are linear.

Time-Varying:

Repeat estimation of proxies and bridge functions at each time point.
Backward recursion is used for dynamic treatment regime estimation, with improved robustness to misspecification in early-stage models.

Implementation Considerations:

Correct proxy variable classification is critical.
Satisfying completeness and independence assumptions is nontrivial and relies on subject-matter expertise.
The methodology's robustness to model misspecification depends on the use of recursive estimation and the inclusion of rich, informative proxies.

The Proximal Causal Learning framework generalizes conventional causal inference by enabling nonparametric identification and estimation of causal effects in the presence of unmeasured confounding—provided that proxy variables are available and appropriately leveraged. Through the proximal g-formula and generalized computation algorithms, PCL offers practical tools for both static and dynamic causal questions, quantitatively addressing the pervasive challenge of hidden confounders in observational research. Analyses of real and simulated data demonstrate that PCL often reveals stronger or different effects than conventional methods, directly attributably to its explicit modeling of latent confounding structure.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Proximal Causal Learning (PCL).