GraN-DAG: Neural DAG Learning

Updated 25 March 2026

GraN-DAG is a score-based framework utilizing neural networks to learn directed acyclic graphs that capture complex, nonlinear dependencies.
It reformulates causal discovery as a constrained optimization problem with a continuous acyclicity constraint, generalizing linear methods like NOTEARS.
Empirical evaluations on synthetic and real datasets show competitive performance in terms of SHD and SID, highlighting its practical utility.

GraN-DAG (“Gradient-Based Neural DAG Learning”) is a score-based framework for learning directed acyclic graphs (DAGs) from observational data, designed to handle complex, nonlinear dependencies between variables by parameterizing each conditional distribution with a neural network. By formulating the causal discovery task as a constrained optimization problem, GraN-DAG introduces a continuous, differentiable acyclicity constraint that enables direct optimization over the space of neural architectures. This approach generalizes linear methods such as NOTEARS to the fully nonlinear case, incorporating global optimization and differentiable structure learning in a single framework (Lachapelle et al., 2019).

1. Problem Formulation and Parameterization

GraN-DAG addresses the problem of recovering an unknown DAG $\mathcal G$ over $d$ real-valued random variables $X = (X_1, ..., X_d)$ from $n$ i.i.d. samples. Each conditional distribution $p_j(x_j \mid x_{-j})$ is modeled with a neural network (NN) parameterized by weights $\phi_{(j)} = \{W_{(j)}^{(1)}, ..., W_{(j)}^{(L+1)}\}$ , with input restricted (via masking) to the current set of putative parents of node $j$ . The architecture ensures that only the predicted parents influence the conditional via a sequence of weight matrices and element-wise nonlinearities.

A core innovation is the introduction of a continuous acyclicity constraint. Whereas in the linear setting (NOTEARS) acyclicity is enforced via $h(U) = \operatorname{Tr}(e^{U \odot U}) - d = 0$ , GraN-DAG generalizes this to the nonlinear case by defining for each node a connectivity matrix

$C_{(j)} = |W_{(j)}^{(L+1)}| \cdots |W_{(j)}^{(1)}| M_{(j)},$

where elementwise absolute values and binary mask matrices $M_{(j)}$ encode pruned inputs. The induced weighted adjacency matrix $A_\phi \in \mathbb{R}^{d \times d}$ is assembled by summing over the entries in $C_{(j)}$ . The final acyclicity constraint is enforced through $h(\phi) = \operatorname{Tr}(e^{A_\phi}) - d = 0$ , holding if and only if the graph is acyclic.

The main objective is the maximization of the average log-likelihood across all nodes and samples, subject to acyclicity:

$\max_{\phi}\; \mathcal S(\phi) \quad\text{s.t.}\quad h(\phi) = 0,$

where

$\mathcal S(\phi) = \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^d \log p_j(x_j^{(i)} \mid x_{-j}^{(i)}; \phi_{(j)}).$

For Gaussian additive noise models (ANMs), this reduces to squared-error loss.

2. Optimization and Gradients

The constrained optimization is addressed via the augmented Lagrangian (AL) method, introducing multipliers $\lambda$ and penalty parameters $\mu$ to construct an unconstrained problem:

$\max_{\phi} \left[ \mathcal S(\phi) - \lambda h(\phi) - \frac{\mu}{2}h(\phi)^2 \right].$

$\lambda$ and $\mu$ are updated as standard in AL methods: $\lambda \leftarrow \lambda + \mu h(\phi)$ , $\mu$ is increased if $h(\phi)$ fails to shrink.

The gradients required are as follows:

For the conditional NNs, the likelihood gradient is standard and, in the Gaussian case, reduces to backpropagation of the squared residual.
The acyclicity penalty gradient exploits matrix calculus, specifically,

$\nabla_{A_\phi} h = (e^{A_\phi})^{\top}$

and the chain rule, as the parameterization of $A_\phi$ in terms of NN weights is nonlinear and involves absolute-value path products, typically handled by automatic differentiation.

3. Neural Architectures and Implementation

In typical scenarios, each node’s conditional is modeled by a neural network with $L=2$ hidden layers of size 10 and leaky-ReLU activations. On data sets with higher risk of overfitting, $L=1$ is used. Weights are initialized via Xavier (Glorot) schemes, RMSprop is employed as the optimizer, and the learning rate is set to $10^{-2}$ for the initial subproblem and $10^{-4}$ thereafter.

Mask matrices $M_{(j)}$ are updated at each stage by hard-thresholding: any edge for which $(A_\phi)_{i,j} < \epsilon = 10^{-4}$ is permanently removed by zeroing out the input in the mask, ensuring sparsity and leading to more interpretable final graphs.

4. Algorithmic Workflow

The high-level GraN-DAG procedure iterates through AL subproblems. For each, it trains the network parameters $\phi$ using RMSprop with minibatches, performs early stopping based on a validation set, updates multipliers, and thresholds small edge weights. This process is repeated until the acyclicity constraint is met with high numerical precision. A final acyclic edge selection is performed based on Jacobian-based scores $J_{i,j} = \mathbb{E}[|\partial \log p_j/\partial x_i|]$ , with edges sorted and pruned until the DAG constraint is satisfied.

Subsequently, a regression-based pruning step analogous to CAM (using Generalized Additive Models and statistical testing) further trims non-significant edges. The end output is the estimated DAG corresponding to nonzero entries in $A_\phi$ .

5. Empirical Evaluation and Results

GraN-DAG was empirically validated on synthetic and real-world datasets using a suite of metrics:

SHD (Structural Hamming Distance): counts added, deleted, and reversed edges.
SID (Structural Intervention Distance): measures performance under single-node interventions.
SHD-C: SHD over the CPDAG, accommodating methods that only return equivalence classes.

Experiments were conducted on synthetic random and scale-free graphs (ER1, ER4, SF1, SF4) with $n=1000$ samples and $d \in \{10,20,50,100\}$ , using Gaussian ANM, linear, additive function, and post-nonlinear (PNL) mechanisms. Realistic cases included the Sachs protein signaling network ( $d=11$ ) and SynTReN gene regulation ( $d=20$ ).

Results demonstrate:

For 10-node ER1 graphs, GraN-DAG yields SHD $= 1.7 \pm 2.5$ and SID $= 1.7 \pm 3.1$ , outperforming continuous methods (NOTEARS: SHD $= 12.2 \pm 2.9$ , DAG-GNN: SHD $= 11.4 \pm 3.1$ ) and competitive with greedy-search baselines (CAM: SHD $= 1.1 \pm 1.1$ ) (Lachapelle et al., 2019).
On 50-node graphs, GraN-DAG maintains strong performance (SHD $= 102.6 \pm 21.2$ , SID $= 1060 \pm 109$ ), while linear and nonlinear continuous baselines degrade.
On real data (Sachs), GraN-DAG and its heteroskedastic extension (GraN-DAG++) achieve SHD $= 13$ , SHD-C $= 11$ , SID $= 47\text{–}48$ —comparable to CAM (SHD $= 12$ , SHD-C $= 9$ ) and better than NOTEARS or DAG-GNN.

Comparative results also indicate that GSF (kernel-based scores) performs worse than GraN-DAG, but better than simple random baselines.

6. Strengths, Limitations, and Potential Extensions

GraN-DAG introduces several advances for directed graph learning:

Nonlinear structural equation modeling using universal function approximators (neural networks).
Fully differentiable acyclicity constraint permits global optimization via gradient-based approaches, as opposed to discrete or greedy search procedures.
The method is compatible with GPU acceleration and flexible network architectures.

However, some limitations are noted:

The optimization landscape is non-convex; only stationary points are guaranteed.
Algorithm performance depends on careful tuning of model and optimization hyperparameters, and overfitting risks are managed via early stopping and GAM-based pruning.
Computational cost is cubic in the number of nodes $d$ due to the use of the matrix exponential, which may be prohibitive for extremely large graphs (though $d \leq 100$ is tractable).

Potential extensions highlighted include:

Adapting the framework to discrete, mixed, or other conditional exponential families.
Investigating faster or alternative acyclicity constraints, e.g., fast approximations to $\operatorname{Tr}(e^A)$ .
Leveraging partial ordering constraints or interventional data.
Enhancing scalability to handle thousands of variables, likely via approximate computation of matrix exponentials or other block-coordinate methods.

GraN-DAG is thus situated as a competitive tool for causality and structure learning, especially for problems involving nonlinear, potentially high-dimensional dependencies, and offers a technically innovative alternative to existing continuous optimization and greedy search methods (Lachapelle et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Gradient-Based Neural DAG Learning (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GraN-DAG.