Papers
Topics
Authors
Recent
Search
2000 character limit reached

When Selection Meets Intervention: Additional Complexities in Causal Discovery

Published 10 Mar 2025 in cs.LG | (2503.07302v1)

Abstract: We address the common yet often-overlooked selection bias in interventional studies, where subjects are selectively enrolled into experiments. For instance, participants in a drug trial are usually patients of the relevant disease; A/B tests on mobile applications target existing users only, and gene perturbation studies typically focus on specific cell types, such as cancer cells. Ignoring this bias leads to incorrect causal discovery results. Even when recognized, the existing paradigm for interventional causal discovery still fails to address it. This is because subtle differences in when and where interventions happen can lead to significantly different statistical patterns. We capture this dynamic by introducing a graphical model that explicitly accounts for both the observed world (where interventions are applied) and the counterfactual world (where selection occurs while interventions have not been applied). We characterize the Markov property of the model, and propose a provably sound algorithm to identify causal relations as well as selection mechanisms up to the equivalence class, from data with soft interventions and unknown targets. Through synthetic and real-world experiments, we demonstrate that our algorithm effectively identifies true causal relations despite the presence of selection bias.

Summary

  • The paper introduces the interventional twin graph model to explicitly address selection bias when interventions occur after selection.
  • The paper presents the CDIS algorithm, which employs a three-step orientation process to accurately infer causal relations and adjust for selection effects.
  • Experimental results on simulated and real-world datasets demonstrate that CDIS reduces spurious correlations and outperforms established baseline methods.

This paper addresses the challenge of selection bias in interventional causal discovery, a common issue where subjects in experiments are not randomly sampled from the general population but are pre-selected based on certain criteria. The authors argue that ignoring this bias leads to incorrect causal conclusions, and existing methods for interventional causal discovery or observational causal discovery with selection bias are insufficient. This insufficiency arises because subtle differences in when and where interventions occur relative to the selection process can significantly alter statistical patterns.

Motivation and Problem Statement:

The paper illustrates that standard augmented DAGs, commonly used in interventional causal discovery, fail to accurately model scenarios where selection precedes intervention. Examples provided, such as a clinical trial where only patients with a disease are enrolled, demonstrate that applying interventions after selection can lead to conditional independencies and dependencies that contradict predictions from conventional augmented DAGs. This discrepancy necessitates a new graphical modeling approach.

Proposed Causal Model: Interventional Twin Graph

To address this, the paper introduces the "interventional twin graph" (G(I)G^{(I)}). This model explicitly accounts for both the observed world (where interventions are applied) and a counterfactual "basal" world (where selection occurs before interventions).

Key components of the interventional twin graph for an intervention target II:

  • ζ\zeta: An exogenous binary indicator for intervention status.
  • XX: Variables in the observed reality (observational or interventional).
  • Xaff∗X^*_{aff}: Variables in the unobserved counterfactual basal world, representing pre-intervention values for variables affected by the intervention (deG(I)de_G(I)). Unaffected variables are represented solely by XX.
  • EaffE_{aff}: Common exogenous noise terms shared by both worlds for affected variables.
  • S∗S^*: Selection status variables in the counterfactual basal world.

Edges in G(I)G^{(I)} represent:

  1. Causal effects in both worlds (e.g., II0 and II1).
  2. Selection edges in the counterfactual world (e.g., II2).
  3. Common exogenous influences (e.g., II3).
  4. Mechanism changes due to intervention (e.g., II4 for II5).

Crucially, all observed data (both observational and interventional) is conditioned on II6, meaning selection has already occurred in the basal world. This model helps explain why, for instance, II7 might change even if II8 does not cause II9, due to selection effects interacting with the intervention on ζ\zeta0.

Markov Properties and Equivalence:

The paper characterizes the Markov properties of the interventional twin graph:

  1. Conditional Independencies (CIs) within an intervention: If ζ\zeta1 in ζ\zeta2, then ζ\zeta3 in the ζ\zeta4-th interventional distribution ζ\zeta5.
  2. Conditional Invariances across interventions: If ζ\zeta6 in ζ\zeta7, then ζ\zeta8.

It's shown that interventions can introduce additional dependencies compared to the original DAG, a contrast to scenarios without pre-intervention selection.

To determine model identifiability, the paper defines Markov equivalence: ζ\zeta9 if they imply the same CIs and invariances. Maximal Ancestral Graphs (MAGs) are used to establish graphical criteria for this equivalence. The MAG XX0 is constructed over observed variables XX1 from XX2 (treating XX3 as latent and XX4 as selection variables). Two pairs XX5 and XX6 are Markov equivalent if and only if for each corresponding intervention, their MAGs of interventional twin graphs (XX7 and XX8) have the same adjacencies and v-structures.

Algorithm: Causal Discovery from Interventional data under potential Selection bias (CDIS)

The paper proposes the CDIS algorithm to learn causal relations and selection mechanisms up to the equivalence class from data with soft interventions, unknown targets, and potential selection bias. CDIS assumes causal sufficiency and faithfulness.

The algorithm proceeds in three main steps:

  1. Maximal orientation from pure observational data (XX9): Run FCI on Xaff∗X^*_{aff}0 to get an initial PAG Xaff∗X^*_{aff}1.
  2. Maximal orientation from interventional data: For each intervention Xaff∗X^*_{aff}2, obtain a PAG Xaff∗X^*_{aff}3 from the pooled data Xaff∗X^*_{aff}4 over Xaff∗X^*_{aff}5. Adjacencies in Xaff∗X^*_{aff}6 must appear in Xaff∗X^*_{aff}7.
  3. Refinement using interventional twin graph-specific criteria: This is an iterative process:
    • 3.1 Orient Xaff∗X^*_{aff}8: Use current knowledge from Xaff∗X^*_{aff}9 and properties like deG(I)de_G(I)0 (intervention indicator is exogenous) to orient edges in deG(I)de_G(I)1, then apply FCIdeG(I)de_G(I)2 (FCI with a rule to orient deG(I)de_G(I)3).
    • 3.2 Update deG(I)de_G(I)4: Use information from the oriented deG(I)de_G(I)5 (e.g., specific edge types, changes in marginal distributions) to further orient edges in deG(I)de_G(I)6 based on the construction rules of MAGs of twin graphs.
    • 3.3 Further orient deG(I)de_G(I)7: Apply FCIdeG(I)de_G(I)8 to deG(I)de_G(I)9. The iteration continues until no new orientations are found for XX0.

The CDIS algorithm is proven to be sound: the output PAG XX1 is consistent with the true MAG XX2. Specifically, XX3 in XX4 implies XX5 in XX6 (and XX7 is not ancestrally selected), and XX8 in XX9 implies both EaffE_{aff}0 and EaffE_{aff}1 are ancestrally selected in EaffE_{aff}2. Completeness is conjectured but not proven.

Experiments and Results:

  • Simulations: CDIS was compared against existing methods (GIES, IGSP, UT-IGSP, CD-NOD, JCI-GSP) on randomly generated DAGs with selection mechanisms and linear SEMs. CDIS generally outperformed baselines in precision, recall, F1 score for `EaffE_{aff}3' edges, edgemark accuracy, and SHD, particularly in precision, suggesting other methods infer more spurious relations due to unhandled selection bias.
  • Real-world Applications:
    • Gene Regulatory Networks (GRNs): Applied to sciPlex2 single-cell perturbation data of A549 human lung cancer cells. CDIS discovered some validated regulatory relationships (e.g., RELA EaffE_{aff}4 RUNX1, JUNB EaffE_{aff}5 MAFF) and highlighted potential spurious correlations due to selection (conditioning on a cell line).
    • Educational Dataset: Analyzed data from a randomized controlled trial on college freshmen's academic achievements. Subgroup analysis by gender suggested heterogeneous treatment effects rather than selection bias based on gender (e.g., SSP improved women's performance, SFP affected men).

Conclusion and Limitations:

The paper introduces a novel framework for interventional causal discovery in the presence of selection bias where selection occurs before intervention. It proposes the interventional twin graph model, characterizes its Markov properties and equivalence, and develops the sound CDIS algorithm.

Limitations include:

  • The model could be extended to handle post-intervention selection (e.g., lost to follow-up).
  • Developing a graphical representation for the full equivalence class is future work.
  • Completeness of the CDIS algorithm is conjectured but not formally proven.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.