Causal-EPIG: Active Learning for CATE

Updated 29 September 2025

Causal-EPIG is a prediction-oriented active learning framework designed to optimize sample efficiency for conditional average treatment effect (CATE) estimation.
It employs two acquisition strategies—comprehensive (Causal-EPIG-μ) and focused (Causal-EPIG-τ)—to balance robustness and efficiency in reducing uncertainty of causal estimands.
Empirical findings demonstrate that Causal-EPIG effectively reduces estimation error (PEHE) compared to conventional methods, particularly in cost-constrained, high-stakes settings.

Causal-EPIG is a prediction-oriented active learning framework for conditional average treatment effect (CATE) estimation that operationalizes the principle of causal objective alignment in active learning settings. It is designed to maximize sample efficiency when labeling outcome data is costly, as in high-stakes applications such as healthcare or personalized policy design. In contrast with conventional pool-based active learning approaches that focus on uncertainty in model parameters or observable factual outcomes, Causal-EPIG redefines the acquisition objective so that each sample acquisition specifically targets the reduction of uncertainty in causal estimands—quantities that are inherently unobservable, such as potential outcomes and CATE. The framework adapts a general information-theoretic criterion, Expected Predictive Information Gain (EPIG), and presents both comprehensive and focused acquisition strategies, providing a unified and context-sensitive approach to sample-efficient CATE estimation.

1. Principle of Causal Objective Alignment

Conventional active learning strategies typically minimize uncertainty in quantities directly measurable in the data, such as observed outcomes or model weights. This objective function misalignment can lead to inefficient or even ineffective data acquisition when the true learning target is a fundamentally unobservable, counterfactual causal quantity. Causal-EPIG formalizes a general principle: active learning acquisition functions in causal inference should be aligned with the causal estimand of interest. Specifically, new queries are selected based on their expected utility in reducing uncertainty about the target causal effect, not just observable or associational uncertainty. This principle is operationalized through acquisition functions that target information gain about unobservable quantities: the pair of potential outcomes or their contrast, the CATE.

2. Information-Theoretic Acquisition Functions

The framework leverages an adaptation of the Expected Predictive Information Gain (EPIG) criterion, which quantifies the expected reduction in entropy of the causal quantity after observing the outcome of a new unit. Two acquisition strategies are developed:

Comprehensive (Causal-EPIG-μ) Strategy:

Acquisition function targets the full joint potential outcomes (y*(0), y*(1)). The information gain for a candidate (x*, t) is defined as:

$\text{Causal-EPIG-μ}(x, t) \equiv \mathbb{E}_{p_\text{tar}(x^*)} \left[ I(y; (y^*(0), y^*(1)) \mid x^*, D'_t) \right].$

This approach robustly models the entire causal mechanism but may spend acquisition effort on aspects of the outcome surface unrelated to CATE.

Focused (Causal-EPIG-τ) Strategy:

Acquisition function targets the CATE directly:

$\text{Causal-EPIG-τ}(x, t) \equiv \mathbb{E}_{p_\text{tar}(x^*)} \left[ I(y; \tau(x^*) \mid x^*, D'_t) \right],$

or, equivalently, as a KL divergence:

$\text{Causal-EPIG-τ}(x, t) = \mathbb{E}_{p_\text{tar}(x^*)} \left[ \text{KL}( p(y, \tau(x^*) \mid x^*, D'_t) \,\|\, p(y \mid D'_t) p(\tau(x^*) \mid x^*, D'_t)) \right].$

By targeting the treatment effect contrast, this approach is highly sample-efficient when the CATE is readily learnable but may accumulate more error if uncertainty about baseline outcomes is not adequately modeled.

These acquisition functions are computed with respect to a pool distribution $p_\text{tar}(x^*)$ , typically representing the target population or desired covariate support, and conditioned on $D'_t$ , the current labeled dataset.

3. Comprehensive vs. Focused Strategies: Trade-offs

The two main acquisition strategies—comprehensive (μ) and focused (τ)—embody a statistical trade-off:

Causal-EPIG-μ is robust, especially in settings with complex relationship between covariates and outcomes or where the prognostic surface influences treatment effect estimation. Because it models (and acquires information about) the joint potential outcomes, it is less likely to miss important structure that affects CATE—but may use the sampling budget less efficiently if irreducible (prognostic) variance dominates.
Causal-EPIG-τ is highly efficient in scenarios where treatment heterogeneity is readily separable from prognostic patterns, or when the base estimator is specifically parameterized for treatment effect contrast. However, it can underperform in settings where estimation error is driven by lack of information about prognostic factors, or when the underlying model decomposes CATE as a difference between uncertain marginal predictions. For models such as Bayesian Causal Forests (BCF), this nuanced partitioning affects sample efficiency.

Empirical results in the paper demonstrate that the optimal strategy is context-dependent, varying with both the base estimator family (e.g., BCF, CMGP, NSGP) and the complexity of the underlying data generating process.

4. Methodological Workflow and Implementation Context

The Causal-EPIG framework is evaluated in pool-based active learning. The practitioner starts with a pool of unlabeled samples and a small, sometimes empty, training set. At each iteration:

A Bayesian CATE estimator (e.g., BCF, CMGP, NSGP) is fit on the observed data.
Each pool sample is scored using the chosen Causal-EPIG acquisition function (μ or τ).
The sample(s) with the highest expected information gain are acquired—i.e., outcome labels are observed and added to the training set.
The model is refit, and the process repeats.

Comparisons against baselines—random selection, coreset strategies, and previous causal active learning approaches such as Causal-BALD and Causal-EIG—are presented on both synthetic data generators (including the Causal-BALD and Hahn models) and semi-synthetic datasets (IHDP, ACTG).

Performance is primarily evaluated in terms of Precision in Estimating Heterogeneous Effects (PEHE), capturing accuracy in CATE estimation, as a function of acquired samples.

5. Experimental Findings and Performance Characteristics

Causal-EPIG (both acquisition variants) consistently outperforms conventional and prior causal active learning baselines in reducing PEHE for a given labeling budget. This performance gap widens as the label acquisition cost becomes more stringent. Specific findings include:

In low signal-to-noise or highly complex settings, Causal-EPIG-μ (comprehensive) is more robust, as it ensures the acquisition of information covering the full outcome surfaces.
In simpler settings or with estimators with strong contrastive parameterizations, Causal-EPIG-τ (focused) can achieve state-of-the-art sample efficiency.
In settings with distribution shift between the pool and the target population, directly targeting information gain over unobservable causal quantities using the specified $p_\text{tar}(x^*)$ further improves sample efficiency and robustness.

These results indicate that causal objective alignment, implemented via the EPIG formalism, provides a practical and theoretically motivated foundation for sample-efficient CATE estimation.

6. Limitations and Prospective Research Directions

Causal-EPIG is developed under the assumption of no unobserved confounding in the data and presumes well-calibrated uncertainty quantification from base Bayesian estimators. The framework does not at present accommodate adaptive experimental design involving treatment assignment but operates under observed treatment indicators. Extending Causal-EPIG to settings with hidden confounding, incorporating new uncertainty-calibrated models (such as CausalPFN), and developing algorithms for interventional design are identified as future research areas.

A plausible implication is that active learning for causal estimation will benefit from acquisition functions that dynamically adapt between comprehensive and focused objectives in response to the structure of the base estimator and data complexity. Further, this framework could be integrated into automated experimental platforms or clinical trial design tools where outcome measurement is a rate-limiting step.

7. Summary Table: Acquisition Strategies

Acquisition Strategy	Target Quantities	Strengths	Limitations/Trade-offs
Causal-EPIG-μ (comprehensive)	Joint potential outcomes	Robust to complex structure, covers full causal model	May expend budget on irrelevant outcomes
Causal-EPIG-τ (focused)	Treatment effect (CATE)	High efficiency when CATE is “simple”; direct targeting	May be fragile if baseline modeling is inaccurate

These strategies are mathematically formalized as:

Comprehensive: $\text{EPIG-μ}(x, t) = \mathbb{E}_{p_\text{tar}(x^*)} [ I(y; (y^*(0), y^*(1)) \mid x^*, D'_t) ]$
Focused: $\text{EPIG-τ}(x, t) = \mathbb{E}_{p_\text{tar}(x^*)} [ I(y; \tau(x^*) \mid x^*, D'_t) ]$

The Causal-EPIG framework thus defines a mathematically principled, empirically validated methodology for resource-constrained causal effect estimation, overcoming the objective mismatch pervasive in prior pool-based active learning strategies for CATE.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Causal-EPIG Framework.