Causal Adjacency Learning (CAL)

Updated 9 December 2025

Causal Adjacency Learning (CAL) is a framework that learns a DAG's binary adjacency matrix from data using statistical models and smooth relaxations.
It employs continuous masking, differentiable acyclicity constraints, and sparsity penalties to enable efficient, gradient-based estimation of causal structures.
Empirical evaluations demonstrate CAL's superior performance in reducing structural errors and boosting true positive rates compared to methods like NOTEARS.

Causal Adjacency Learning (CAL) is the process of learning the edge structure—that is, the adjacency matrix—of a causal graph, typically a directed acyclic graph (DAG), from observational or interventional data. This problem is foundational for causal discovery, as the adjacency encodes the direct causal influences between variables. Research in CAL combines statistical theory, differentiable optimization, graphical modeling, and advances in computational methodology to recover reliable causal structures from data, even in high-dimensional or challenging regimes.

1. Formal Definitions and Structural Framework

In CAL, the aim is to recover the binary adjacency matrix $A \in \{0,1\}^{d\times d}$ of a DAG $\mathcal{G}$ given data $X = \{x^{(k)}\}_{k=1}^n$ generated by a structural equation model (SEM). In the standard additive-noise SEM form,

$X_i = f_i(X_{pa(i)}) + \epsilon_i$

where $pa(i)$ indexes the parents of node $i$ in $\mathcal{G}$ , $\epsilon_i$ are independent noise variables, and the $f_i$ are non-constant in each argument to enforce causal minimality. The adjacency matrix $A_{ji}=1$ if $X_j$ is a parent of $X_i$ , and $A_{ji}=0$ otherwise. This adjacency may be binary or, in weighted variants, may be real-valued but with its support indicating the graph structure (Ng et al., 2019).

Learning the adjacency amounts to inferring for each pair $(i,j)$ whether there is a direct causal edge, $X_j\to X_i$ , consistent with observed conditional independencies or the statistical properties of the data distribution.

2. Optimization and Smooth Relaxations

Direct optimization over discrete combinatorial objects (the entries of $A$ ) is intractable for even moderately sized $d$ . CAL approaches thus rely on continuous relaxations and penalized objectives:

Continuous Masking: The binary adjacency matrix $A$ is parameterized via a continuous surrogate, such as $\hat A_{ji} = \sigma\left( (U_{ji}+\xi_{ji})/\tau \right)$ where $U$ is an unconstrained "logit" matrix, $\xi_{ji}\sim \text{Logistic}(0,1)$ , $\sigma(\cdot)$ is the sigmoid, and $\tau>0$ is a temperature (Ng et al., 2019). Small $\tau$ ensures that $\hat A_{ji}$ concentrates near $\{0,1\}$ .
Acyclicity Constraints: To ensure the learned structure is a DAG, a smooth functional such as $h(A) = \text{Tr}\left(e^{A}\right) - d$ is employed. This differentiable constraint is zero if and only if $A$ corresponds to an acyclic graph (Ng et al., 2019). Penalties or augmented Lagrangian terms enforce (or nudge) acyclicity during training.
Sparsity Induction: An $\ell_1$ penalty $\lambda\|A\|_1$ promotes graph sparsity. Tuning $\lambda$ controls the trade-off between data fit and model complexity.
Augmented Lagrangian: Optimization proceeds by alternately updating $\{U,\phi\}$ (mask logits and SEM parameters), Lagrange multipliers, and acyclicity penalty weights (Ng et al., 2019). After convergence, the mean mask matrix $M=\sigma(U/\tau)$ is thresholded to yield a discrete, guaranteed-acyclic adjacency estimate.

This relaxation–thresholding paradigm enables efficient, gradient-based learning and supports a variety of smooth SEM function classes.

3. Identifiability and Statistical Guarantees

The identifiability of the causal adjacency from observed data depends critically on assumptions about the SEM and noise.

ANM Identifiability: Under the "restricted additive noise model" (ANM) assumptions—independent noises with strictly positive density and nondegenerate (nonlinear) $f_i$ —the true DAG is identifiable from the joint distribution $P(X)$ (Ng et al., 2019).
Supergraph Recovery: In the infinite data and correctly specified $g_i$ limit, the learned adjacency is guaranteed to contain the true DAG's edges (up to a supergraph). Extraneous edges can be subsequently pruned using statistical tests or post-processing such as CAM-pruning (Ng et al., 2019).
Consistency: Provided the loss optimization achieves global minima and model class is expressive enough, CAL procedures are consistent estimators of the underlying graph, assuming identifiability. Mild conditions (non-degeneracy, independence, and causal minimality) suffice.

4. Empirical Evaluation and Benchmarking

State-of-the-art CAL methods are validated via both synthetic and real-world data.

Synthetic DAGs: Experiments on Erdős–Rényi random DAGs (e.g., $d\in\{10,20,50,100\}$ , degree $\in\{1,4\}$ ) and various SEM types (Gaussian Process, quadratic, post-nonlinear) assess estimator performance via structural Hamming distance (SHD) and true positive rate (TPR) (Ng et al., 2019). CAL using smooth masking and acyclicity constraints (e.g., MCSL) yields SHD reductions of 20–50% relative to baselines such as NOTEARS, GraN-DAG, and DAG-GNN.
Real Networks: CAL matches or outperforms previous approaches in protein-signaling networks (e.g., Sachs data, SHD=12, best known) and telecom fault root-cause graphs (finding $\approx80\%$ of true causes vs $\leq 30\%$ for alternatives) (Ng et al., 2019).
Comparisons: Methods employing smooth adjacency masking, differentiable DAG constraints, and auxiliary pruning dominate classical greedy search and constraint-based algorithms, especially on nonlinear or moderately sized graphs.

5. Relation to Broader Causal Discovery Literature

CAL approaches build on and extend several key frameworks:

Note on Orientation vs. Adjacency: CAL focuses exclusively on learning the adjacency/skeleton (presence of edges), not necessarily their orientations when the faithfulness assumptions do not fully hold. Algorithms such as Conservative PC and constraint-based variants clarify what guarantees hold under Adjacency-Faithfulness vs. full Faithfulness (Ramsey et al., 2012).
Generalization Beyond SEMs: While CAL has been classically developed for additive-noise or linear SEMs, the masked-gradient structure extends naturally to deep or nonlinear SCMs (with $g_i$ parameterized as neural networks) (Ng et al., 2019).
Regularization and Auxiliary Losses: Recent work embeds CAL within predictive and autoencoding architectures, using auxiliary reconstruction and acyclicity penalties to steer representation learning toward causally faithful adjacencies (Kyono et al., 2020).

6. Limitations, Extensions, and Practical Considerations

Model Mis-Specification: When the SEM class $g_i$ does not match the true data-generating process, recovery guarantees may only hold up to the supergraph, and spurious edges may require dedicated pruning stages.
Thresholding and Discreteness: The final estimation of discrete adjacency requires hard thresholding; small values of the mask temperature $\tau$ in the Gumbel-Softmax ensure masks are near binary, but threshold selection still affects sensitivity-specificity tradeoffs.
Scalability: The combination of differentiable masking and acyclicity constraints allows CAL methods to scale to moderate ( $d\sim 100$ ) node regimes, but is still limited by computational costs in very high dimensions.
Post-Processing: CAM-pruning and similar post hoc strategies are often necessary to control for over-inclusion induced by regularization and to isolate the minimal true causal skeleton.

CAL thus comprises a robust approach to causal structure discovery, central to modern differentiable causal discovery pipelines, with well-understood identifiability conditions, empirical superiority to prior methods, and relevance to both theoretical and real-world applications (Ng et al., 2019).