- The paper introduces the interventional twin graph model to explicitly address selection bias when interventions occur after selection.
- The paper presents the CDIS algorithm, which employs a three-step orientation process to accurately infer causal relations and adjust for selection effects.
- Experimental results on simulated and real-world datasets demonstrate that CDIS reduces spurious correlations and outperforms established baseline methods.
This paper addresses the challenge of selection bias in interventional causal discovery, a common issue where subjects in experiments are not randomly sampled from the general population but are pre-selected based on certain criteria. The authors argue that ignoring this bias leads to incorrect causal conclusions, and existing methods for interventional causal discovery or observational causal discovery with selection bias are insufficient. This insufficiency arises because subtle differences in when and where interventions occur relative to the selection process can significantly alter statistical patterns.
Motivation and Problem Statement:
The paper illustrates that standard augmented DAGs, commonly used in interventional causal discovery, fail to accurately model scenarios where selection precedes intervention. Examples provided, such as a clinical trial where only patients with a disease are enrolled, demonstrate that applying interventions after selection can lead to conditional independencies and dependencies that contradict predictions from conventional augmented DAGs. This discrepancy necessitates a new graphical modeling approach.
Proposed Causal Model: Interventional Twin Graph
To address this, the paper introduces the "interventional twin graph" (G(I)). This model explicitly accounts for both the observed world (where interventions are applied) and a counterfactual "basal" world (where selection occurs before interventions).
Key components of the interventional twin graph for an intervention target I:
- ζ: An exogenous binary indicator for intervention status.
- X: Variables in the observed reality (observational or interventional).
- Xaff∗​: Variables in the unobserved counterfactual basal world, representing pre-intervention values for variables affected by the intervention (deG​(I)). Unaffected variables are represented solely by X.
- Eaff​: Common exogenous noise terms shared by both worlds for affected variables.
- S∗: Selection status variables in the counterfactual basal world.
Edges in G(I) represent:
- Causal effects in both worlds (e.g., I0 and I1).
- Selection edges in the counterfactual world (e.g., I2).
- Common exogenous influences (e.g., I3).
- Mechanism changes due to intervention (e.g., I4 for I5).
Crucially, all observed data (both observational and interventional) is conditioned on I6, meaning selection has already occurred in the basal world. This model helps explain why, for instance, I7 might change even if I8 does not cause I9, due to selection effects interacting with the intervention on ζ0.
Markov Properties and Equivalence:
The paper characterizes the Markov properties of the interventional twin graph:
- Conditional Independencies (CIs) within an intervention: If ζ1 in ζ2, then ζ3 in the ζ4-th interventional distribution ζ5.
- Conditional Invariances across interventions: If ζ6 in ζ7, then ζ8.
It's shown that interventions can introduce additional dependencies compared to the original DAG, a contrast to scenarios without pre-intervention selection.
To determine model identifiability, the paper defines Markov equivalence: ζ9 if they imply the same CIs and invariances. Maximal Ancestral Graphs (MAGs) are used to establish graphical criteria for this equivalence. The MAG X0 is constructed over observed variables X1 from X2 (treating X3 as latent and X4 as selection variables). Two pairs X5 and X6 are Markov equivalent if and only if for each corresponding intervention, their MAGs of interventional twin graphs (X7 and X8) have the same adjacencies and v-structures.
Algorithm: Causal Discovery from Interventional data under potential Selection bias (CDIS)
The paper proposes the CDIS algorithm to learn causal relations and selection mechanisms up to the equivalence class from data with soft interventions, unknown targets, and potential selection bias. CDIS assumes causal sufficiency and faithfulness.
The algorithm proceeds in three main steps:
- Maximal orientation from pure observational data (X9): Run FCI on Xaff∗​0 to get an initial PAG Xaff∗​1.
- Maximal orientation from interventional data: For each intervention Xaff∗​2, obtain a PAG Xaff∗​3 from the pooled data Xaff∗​4 over Xaff∗​5. Adjacencies in Xaff∗​6 must appear in Xaff∗​7.
- Refinement using interventional twin graph-specific criteria: This is an iterative process:
- 3.1 Orient Xaff∗​8: Use current knowledge from Xaff∗​9 and properties like deG​(I)0 (intervention indicator is exogenous) to orient edges in deG​(I)1, then apply FCIdeG​(I)2 (FCI with a rule to orient deG​(I)3).
- 3.2 Update deG​(I)4: Use information from the oriented deG​(I)5 (e.g., specific edge types, changes in marginal distributions) to further orient edges in deG​(I)6 based on the construction rules of MAGs of twin graphs.
- 3.3 Further orient deG​(I)7: Apply FCIdeG​(I)8 to deG​(I)9.
The iteration continues until no new orientations are found for X0.
The CDIS algorithm is proven to be sound: the output PAG X1 is consistent with the true MAG X2. Specifically, X3 in X4 implies X5 in X6 (and X7 is not ancestrally selected), and X8 in X9 implies both Eaff​0 and Eaff​1 are ancestrally selected in Eaff​2. Completeness is conjectured but not proven.
Experiments and Results:
- Simulations: CDIS was compared against existing methods (GIES, IGSP, UT-IGSP, CD-NOD, JCI-GSP) on randomly generated DAGs with selection mechanisms and linear SEMs. CDIS generally outperformed baselines in precision, recall, F1 score for `Eaff​3' edges, edgemark accuracy, and SHD, particularly in precision, suggesting other methods infer more spurious relations due to unhandled selection bias.
- Real-world Applications:
- Gene Regulatory Networks (GRNs): Applied to sciPlex2 single-cell perturbation data of A549 human lung cancer cells. CDIS discovered some validated regulatory relationships (e.g., RELA Eaff​4 RUNX1, JUNB Eaff​5 MAFF) and highlighted potential spurious correlations due to selection (conditioning on a cell line).
- Educational Dataset: Analyzed data from a randomized controlled trial on college freshmen's academic achievements. Subgroup analysis by gender suggested heterogeneous treatment effects rather than selection bias based on gender (e.g., SSP improved women's performance, SFP affected men).
Conclusion and Limitations:
The paper introduces a novel framework for interventional causal discovery in the presence of selection bias where selection occurs before intervention. It proposes the interventional twin graph model, characterizes its Markov properties and equivalence, and develops the sound CDIS algorithm.
Limitations include:
- The model could be extended to handle post-intervention selection (e.g., lost to follow-up).
- Developing a graphical representation for the full equivalence class is future work.
- Completeness of the CDIS algorithm is conjectured but not formally proven.