Relational Causal Discovery (RCD) Algorithm
- Relational causal discovery is a constraint-based algorithm that identifies causal relationships in multi-entity, multi-relational data using conditional independence tests.
- It employs advanced techniques like σ-separation and acyclification to handle cyclic feedback and latent confounders, ensuring sound and complete graph recovery.
- The two-phase approach—skeleton discovery followed by edge orientation—has been validated with high precision and recall across synthetic and real-world datasets.
Relational causal discovery (RCD) is a constraint-based algorithmic framework designed to recover causal graphical structure—including the orientation of directed dependencies—within relational data. Unlike standard i.i.d. causal discovery algorithms, RCD exploits the multi-entity, multi-relational structure of data, enabling causal inference in settings such as multi-table databases, networks, or any domain best described by a relational schema. RCD is both sound and complete under faithfulness assumptions, and generalizes to cyclic (i.e., feedback) models via relational acyclification and σ-separation, as well as to settings with latent confounders through further algorithmic extensions.
1. Relational Causal Models and Conditional Independence
A relational causal model (RCM) is defined over a schema , with entity classes , relationship classes , attribute classes , and cardinality constraints specifying participation counts. The model specifies a set of canonical relational dependencies,
where , are relational paths (i.e., joins or traversals through the schema), of length up to a user-specified hop threshold , and are attributes.
For a concrete skeleton , which instantiates entity and relationship objects, the model "grounds" into a ground graph over all instantiated attributes, potentially containing cycles (feedback loops) if mutual influences exist. Relational d-separation, and its generalization σ-separation for feedback models, define conditional independence relations that lift over all possible ground graphs. These "abstract ground graphs" (AGGs for acyclic models, σ-AGGs for cyclic) serve as the substrate for lifted CI reasoning (Maier et al., 2013, Ahsan et al., 2022).
2. The RCD Algorithmic Framework
RCD operates in two primary phases: skeleton discovery and edge orientation, closely mirroring the structure of the PC algorithm but leveraging the relational structure.
- Skeleton Discovery phase:
- Enumerate all candidate dependencies (relational variables up to hops).
- For each pair , iteratively test conditional independence given all subsets of their neighbors up to conditioning size , utilizing relational d-separation (or σ-separation for cyclic models) to remove edges when independence is detected.
- Edge Orientation phase:
- Identify all unshielded triples .
- Apply collider detection with separating sets.
- Use relational-specific rules such as Relational Bivariate Orientation (RBO), Known Non-Colliders (KNC), Cycle Avoidance (CA), and the standard Meek propagation rules to orient as many edges as possible.
For cyclic RCMs, RCD replaces d-separation with σ-separation, and expands the set of candidate variables and conditioning sets to those necessary for any relational acyclification of the σ-AGG, with hop limit sufficient to "unroll" cycles (Ahsan et al., 2022).
High-level Pseudocode for RCD on cyclic RCMs
Let be the original hop limit and the maximum expected cycle length. Set to ensure coverage of possible cycles.
(Ahsan et al., 2022, Maier et al., 2013)
3. Relational σ-Separation, Acyclification, and Generalization to Cycles
For RCMs whose lifted ground graphs contain cycles (i.e., mutual causal influence), d-separation does not suffice; instead, σ-separation is needed to correctly read off conditional independence relationships (Ahsan et al., 2022). Each σ-AGG may be "acyclified" by orienting cycle edges so as to produce an acyclic graph with equivalent σ-independence structure. The hop threshold must be chosen large enough that the "unrolled" cycles are covered; suffices for cycles of length .
A key result is that, under σ-faithfulness and causal sufficiency, running RCD with σ-separation and acyclification yields a sound and complete recovery of the Markov equivalence class of the true relational model, even in the presence of cycles:
- Soundness: Every adjacency and orientation in the output appears in the ground graph.
- Arrowhead-Completeness: All non-ancestor oriented edges are recovered.
- Tail-Completeness: Every ancestor relation is marked as such.
- Markov-Completeness: Two σ-AGGs are σ-Markov equivalent iff RCD produces the same PAG.
Proofs follow by reduction to standard PC/FCI arguments via the acyclification mapping (Ahsan et al., 2022).
4. Handling Latent Confounding and Extensions
Classical RCD requires causal sufficiency; the presence of unobserved (latent) confounders violates these assumptions. Extensions such as RelFCI generalize RCD by operating over partial ancestral abstract ground graphs (PAAGGs), applying PC/FCI rules (CD, RBO, KNC, CA, MR3, FCI R4–R10) to propagate uncertainty marks, leveraging a hop-threshold . Under d-faithfulness and "no latent descendant" assumptions, this framework is sound and complete for recovering the Markov equivalence class of latent relational causal models (Piras et al., 2 Jul 2025). RelFCI is empirically superior to RCD in the presence of latents and matches RCD's performance otherwise.
5. Algorithmic and Statistical Properties
RCD's correctness is rooted in the relational d-separation theorem, ensuring that conditional independence read off the AGG (or σ-AGG) implies CI at the ground instance level (Maier et al., 2013). The algorithm is:
- Sound and Complete (Causal Sufficiency, Faithfulness, Adequate Hop/Conditioning Bounds):
- Returns the exact Markov equivalence class of the true relational (possibly cyclic) model.
- Orients all compelled edges and identifies all ancestral relations.
- For RCD, acyclicity of the true model is not required if acyclification is performed.
- Complexity:
- Skeleton phase: CI tests, where is the number of relational variables and is the maximal adjacency.
- Orientation phase: Each rule application is per iteration, with total iterations bounded by the number of dependencies.
- In practice, feasible for moderate schemas with small , .
Robust RCD uses kernel-based CI tests (e.g., HSIC, SDCIT) adapted to multisets and flattenings, mitigating the lack of i.i.d. observations in relational data. For RCMC-related queries, theoretical control of Type I error is attained (Lee et al., 2019).
6. Empirical Validation and Benchmark Results
RCD, σ-RCD, and RelFCI have been thoroughly validated:
- Synthetic Data:
- RCD and its robust variants achieve skeleton precision/recall ≈1.0; orientation precision significantly exceeds naive baselines.
- σ-RCD (with σ-separation and acyclification) attains near-perfect F1 on ancestor relations and correctly flags candidate cycle-edges. d-RCD (using naive d-separation in cyclic models) exhibits orientation errors in feedback-rich domains (Ahsan et al., 2022).
- Latent Confounders:
- RelFCI outperforms standard RCD (precision ≈ 0.97–0.99, recall ≈ 0.92–0.98 vs RCD's ≈ 0.80–0.85) in the presence of latents (Piras et al., 2 Jul 2025).
- Real-World Data:
- On MovieLens+ (large-scale user-movie-critic dataset), RCD and σ-RCD recover interpretable dependencies, including plausible feedback (cycles) between critic and user ratings, as well as correct orientation of external-cause edges such as Budget → Gross Income (Ahsan et al., 2022, Maier et al., 2013).
- Statistical Properties:
- On synthetic data, robust RCD achieves high orientation accuracy, with precision ≈93.5% and recall ≈75.4% for oriented CPRCMs at moderate sample sizes. Relational aggregation boosts recall with minimal impact on precision (Lee et al., 2019).
7. Practical Considerations, Limitations, and Directions
RCD requires sufficiently large samples per entity class for reliable CI testing; kernel-based and aggregation-enhanced methods improve robustness to finite-sample effects. Computational costs grow polynomially with schema complexity for small hop/conditioning bounds, but can become prohibitive for dense, high-hop schemas.
Key open challenges include:
- Automated selection of aggregation functions for multiset-valued relational variables.
- Development of more powerful CI tests specifically tailored to relational multisets.
- Further generalization to handle dynamic relational skeletons, latent relationships, and transportability across distributions (Lee et al., 2019).
RCD, in its classic, cyclic, robust, and latent-augmented forms, remains the leading rigorously justified framework for constraint-based causal discovery in relational domains (Maier et al., 2013, Ahsan et al., 2022, Piras et al., 2 Jul 2025, Lee et al., 2019).