When Selection Meets Intervention: Additional Complexities in Causal Discovery

Published 10 Mar 2025 in cs.LG | (2503.07302v1)

Abstract: We address the common yet often-overlooked selection bias in interventional studies, where subjects are selectively enrolled into experiments. For instance, participants in a drug trial are usually patients of the relevant disease; A/B tests on mobile applications target existing users only, and gene perturbation studies typically focus on specific cell types, such as cancer cells. Ignoring this bias leads to incorrect causal discovery results. Even when recognized, the existing paradigm for interventional causal discovery still fails to address it. This is because subtle differences in when and where interventions happen can lead to significantly different statistical patterns. We capture this dynamic by introducing a graphical model that explicitly accounts for both the observed world (where interventions are applied) and the counterfactual world (where selection occurs while interventions have not been applied). We characterize the Markov property of the model, and propose a provably sound algorithm to identify causal relations as well as selection mechanisms up to the equivalence class, from data with soft interventions and unknown targets. Through synthetic and real-world experiments, we demonstrate that our algorithm effectively identifies true causal relations despite the presence of selection bias.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces the interventional twin graph model to explicitly address selection bias when interventions occur after selection.
The paper presents the CDIS algorithm, which employs a three-step orientation process to accurately infer causal relations and adjust for selection effects.
Experimental results on simulated and real-world datasets demonstrate that CDIS reduces spurious correlations and outperforms established baseline methods.

This paper addresses the challenge of selection bias in interventional causal discovery, a common issue where subjects in experiments are not randomly sampled from the general population but are pre-selected based on certain criteria. The authors argue that ignoring this bias leads to incorrect causal conclusions, and existing methods for interventional causal discovery or observational causal discovery with selection bias are insufficient. This insufficiency arises because subtle differences in when and where interventions occur relative to the selection process can significantly alter statistical patterns.

Motivation and Problem Statement:

The paper illustrates that standard augmented DAGs, commonly used in interventional causal discovery, fail to accurately model scenarios where selection precedes intervention. Examples provided, such as a clinical trial where only patients with a disease are enrolled, demonstrate that applying interventions after selection can lead to conditional independencies and dependencies that contradict predictions from conventional augmented DAGs. This discrepancy necessitates a new graphical modeling approach.

Proposed Causal Model: Interventional Twin Graph

To address this, the paper introduces the "interventional twin graph" ( $G^{(I)}$ ). This model explicitly accounts for both the observed world (where interventions are applied) and a counterfactual "basal" world (where selection occurs before interventions).

Key components of the interventional twin graph for an intervention target $I$ :

$\zeta$ : An exogenous binary indicator for intervention status.
$X$ : Variables in the observed reality (observational or interventional).
$X^*_{aff}$ : Variables in the unobserved counterfactual basal world, representing pre-intervention values for variables affected by the intervention ( $de_G(I)$ ). Unaffected variables are represented solely by $X$ .
$E_{aff}$ : Common exogenous noise terms shared by both worlds for affected variables.
$S^*$ : Selection status variables in the counterfactual basal world.

Edges in $G^{(I)}$ represent:

Causal effects in both worlds (e.g., $I$ 0 and $I$ 1).
Selection edges in the counterfactual world (e.g., $I$ 2).
Common exogenous influences (e.g., $I$ 3).
Mechanism changes due to intervention (e.g., $I$ 4 for $I$ 5).

Crucially, all observed data (both observational and interventional) is conditioned on $I$ 6, meaning selection has already occurred in the basal world. This model helps explain why, for instance, $I$ 7 might change even if $I$ 8 does not cause $I$ 9, due to selection effects interacting with the intervention on $\zeta$ 0.

Markov Properties and Equivalence:

The paper characterizes the Markov properties of the interventional twin graph:

Conditional Independencies (CIs) within an intervention: If $\zeta$ 1 in $\zeta$ 2, then $\zeta$ 3 in the $\zeta$ 4-th interventional distribution $\zeta$ 5.
Conditional Invariances across interventions: If $\zeta$ 6 in $\zeta$ 7, then $\zeta$ 8.

It's shown that interventions can introduce additional dependencies compared to the original DAG, a contrast to scenarios without pre-intervention selection.

To determine model identifiability, the paper defines Markov equivalence: $\zeta$ 9 if they imply the same CIs and invariances. Maximal Ancestral Graphs (MAGs) are used to establish graphical criteria for this equivalence. The MAG $X$ 0 is constructed over observed variables $X$ 1 from $X$ 2 (treating $X$ 3 as latent and $X$ 4 as selection variables). Two pairs $X$ 5 and $X$ 6 are Markov equivalent if and only if for each corresponding intervention, their MAGs of interventional twin graphs ( $X$ 7 and $X$ 8) have the same adjacencies and v-structures.

Algorithm: Causal Discovery from Interventional data under potential Selection bias (CDIS)

The paper proposes the CDIS algorithm to learn causal relations and selection mechanisms up to the equivalence class from data with soft interventions, unknown targets, and potential selection bias. CDIS assumes causal sufficiency and faithfulness.

The algorithm proceeds in three main steps:

Maximal orientation from pure observational data ( $X$ 9): Run FCI on $X^*_{aff}$ 0 to get an initial PAG $X^*_{aff}$ 1.
Maximal orientation from interventional data: For each intervention $X^*_{aff}$ 2, obtain a PAG $X^*_{aff}$ 3 from the pooled data $X^*_{aff}$ 4 over $X^*_{aff}$ 5. Adjacencies in $X^*_{aff}$ 6 must appear in $X^*_{aff}$ 7.
Refinement using interventional twin graph-specific criteria: This is an iterative process:
- 3.1 Orient $X^*_{aff}$ 8: Use current knowledge from $X^*_{aff}$ 9 and properties like $de_G(I)$ 0 (intervention indicator is exogenous) to orient edges in $de_G(I)$ 1, then apply FCI $de_G(I)$ 2 (FCI with a rule to orient $de_G(I)$ 3).
- 3.2 Update $de_G(I)$ 4: Use information from the oriented $de_G(I)$ 5 (e.g., specific edge types, changes in marginal distributions) to further orient edges in $de_G(I)$ 6 based on the construction rules of MAGs of twin graphs.
- 3.3 Further orient $de_G(I)$ 7: Apply FCI $de_G(I)$ 8 to $de_G(I)$ 9. The iteration continues until no new orientations are found for $X$ 0.

The CDIS algorithm is proven to be sound: the output PAG $X$ 1 is consistent with the true MAG $X$ 2. Specifically, $X$ 3 in $X$ 4 implies $X$ 5 in $X$ 6 (and $X$ 7 is not ancestrally selected), and $X$ 8 in $X$ 9 implies both $E_{aff}$ 0 and $E_{aff}$ 1 are ancestrally selected in $E_{aff}$ 2. Completeness is conjectured but not proven.

Experiments and Results:

Simulations: CDIS was compared against existing methods (GIES, IGSP, UT-IGSP, CD-NOD, JCI-GSP) on randomly generated DAGs with selection mechanisms and linear SEMs. CDIS generally outperformed baselines in precision, recall, F1 score for ` $E_{aff}$ 3' edges, edgemark accuracy, and SHD, particularly in precision, suggesting other methods infer more spurious relations due to unhandled selection bias.
Real-world Applications:
- Gene Regulatory Networks (GRNs): Applied to sciPlex2 single-cell perturbation data of A549 human lung cancer cells. CDIS discovered some validated regulatory relationships (e.g., RELA $E_{aff}$ 4 RUNX1, JUNB $E_{aff}$ 5 MAFF) and highlighted potential spurious correlations due to selection (conditioning on a cell line).
- Educational Dataset: Analyzed data from a randomized controlled trial on college freshmen's academic achievements. Subgroup analysis by gender suggested heterogeneous treatment effects rather than selection bias based on gender (e.g., SSP improved women's performance, SFP affected men).

Conclusion and Limitations:

The paper introduces a novel framework for interventional causal discovery in the presence of selection bias where selection occurs before intervention. It proposes the interventional twin graph model, characterizes its Markov properties and equivalence, and develops the sound CDIS algorithm.

Limitations include:

The model could be extended to handle post-intervention selection (e.g., lost to follow-up).
Developing a graphical representation for the full equivalence class is future work.
Completeness of the CDIS algorithm is conjectured but not formally proven.