Directed Acyclic Graph (DAG) Learning

Updated 17 October 2025

Directed acyclic graph (DAG) learning is a framework for inferring causal structures and conditional independencies from data, emphasizing parent-child relationship discovery.
Score-based and constraint-based methods, including techniques like NOTEARS and PC algorithms, are employed to manage high-dimensional search spaces and uncertainty.
Recent advances incorporate continuous acyclicity constraints, neural network frameworks, and ensemble methods to enhance scalability, interpretability, and accuracy in practical applications.

Directed acyclic graph (DAG) learning refers to the family of statistical, algorithmic, and optimization methodologies for estimating the structure of a DAG that encodes conditional independencies or causal relationships among random variables, typically from observational or interventional data. The objective is to recover the edge set and orientation (parent–child relationships) underlying complex systems such as gene regulatory networks, brain networks, and general probabilistic graphical models. This undertaking is fundamentally challenging due to the super-exponential growth of the DAG space with the number of nodes, nonidentifiability from purely observational data (when only Markov equivalence is available), and practical constraints such as high dimensionality, low sample sizes, and domain-specific constraints.

1. Statistical Principles and Identifiability

Directed acyclic graphs serve as graphical encodings of joint probability distributions under the Markov property, enabling the representation of direct dependency structure. Under the standard Markov condition, a variable is independent of its non-descendants given its parents. Identifiability of the DAG from observational data is often not guaranteed due to the existence of Markov equivalent DAGs that encode the same set of covariances and conditional independence relationships. Without additional assumptions (such as faithfulness, strong-faithfulness, or knowledge of variable ordering), the best that purely statistical algorithms can do is recover the equivalence class, typically represented as a completed partially directed acyclic graph (CPDAG) (Shojaie et al., 24 Mar 2024).

Causal interpretability is inherently constrained by the impossibility of distinguishing between certain graph structures based solely on observed conditional independencies, motivating the use of interventions, partial background knowledge (e.g., partial orderings), or model class restrictions for progress in real-world applications.

2. Score-based and Constraint-based Learning Algorithms

The two main strands of DAG learning algorithms are score-based and constraint-based methods.

Score-based algorithms assign a numerical score to each possible DAG based on the observed data and search for the graph with minimal score (such as negative log-likelihood plus penalty). The BIC and variants are prototypical scoring functions, with complexity and sparsity penalties to avoid overfitting (Wang et al., 2014, Manzour et al., 2019). Traditional structure search is computationally infeasible for all but small graphs, and so efficient heuristics (e.g., greedy hill climbing, dynamic programming, or estimator relaxations) are essential.
Constraint-based approaches (e.g., the PC algorithm) leverage conditional independence tests to eliminate inconsistent edges and orient as many as possible via graphical rules. Extensions of these methods (PC with p-values, control of false discovery rate, knockoff-based edge selection) are developed to handle high-dimensional regimes and multiple hypothesis testing (Shojaie et al., 24 Mar 2024).

Hybrid algorithms combine score- and constraint-based principles, incorporating prior knowledge and user-defined constraints.

3. Optimization Frameworks and Acyclicity Constraints

Learning a DAG structure is fundamentally a constrained optimization problem with the acyclicity constraint being non-trivial. Recent advances reformulate this constraint into a continuously differentiable or convex form. Notably:

The NOTEARS framework [Zheng et al., 2018] and its successors express acyclicity of a (weighted) adjacency matrix $A$ via trace-based conditions such as $\text{tr}(\exp(A \circ A)) - d = 0$ , where $\circ$ is the Hadamard product and $d$ the number of nodes. This enables the use of smooth constrained optimization algorithms.
Extensions to nonlinear (nonparametric) settings model dependencies via neural networks (Lachapelle et al., 2019) or RKHS representations with sparse derivative penalties (Liang et al., 20 Aug 2024), using continuous acyclicity constraints (such as log-determinant formulations $h_\text{ldet}^s(A) = -\log \det(sI_d - A) + d\log s$ under spectral constraints).
Integer programming approaches (Manzour et al., 2019) (including the Layered Network (LN) formulation) introduce compact mixed-integer quadratic programs with layer variables enforcing topological ordering via constraints on node positions, harnessing sparsity of prior super-structures for tractability.
For count data and non-Gaussian models, specialized procedures leverage properties such as quadratic variance functions for topological layer recovery (Zhou et al., 2021) or employ feature selection and statistical significance testing in discrete settings (Nguyen et al., 7 Jun 2024).

4. Ensemble Methods and Stability

Model instability and a high rate of false positives are prevalent, especially in high-dimensional, low-sample-size scenarios. The DAGBag procedure (Wang et al., 2014) introduces bootstrap aggregating: multiple DAGs are estimated on bootstrap replicates, and an aggregated consensus DAG is constructed to minimize the ensemble SHD (structural Hamming distance) or its generalizations. Aggregation relies on edge selection frequencies across resamples, promoting only those edges that are stable above an empirical threshold into the final structure, reducing variance and false edge discoveries.

The aggregation procedure can be formally described as: $\text{score.SHD}(\mathcal{G} : \mathbb{G}^e) = \sum_{e \in \mathbb{E}(\mathcal{G})} \left(1 - 2p_e\right) + C$ where $p_e$ is the selection frequency of directed edge $e$ and $C$ is a constant.

5. Nonparametric, Neural, and Convex Approaches

Continuous optimization opened the field to a range of machine learning techniques for DAG learning.

Graph neural network parameterizations (e.g., DAG-GNN (Yu et al., 2019)) model the SEM as a variational autoencoder with an explicit acyclicity constraint, enabling support for nonlinear, discrete, and vector-valued data.
Gradient-based neural methods (Lachapelle et al., 2019) use feedforward networks for conditional modeling, extract "connectivity" matrices from network weights, and enforce acyclicity via matrix exponential constraints, using an augmented Lagrangian.
Convex acyclicity formulations (Rey et al., 12 Sep 2024) leverage non-negativity of weighted edges in the DAG, permitting the use of convex log-determinant constraints that guarantee global optima in the infinite-sample regime. This approach provides explicit gradient formulas and ensures structural recovery, addressing limitations of local optima endemic to earlier non-convex methods.
Discrete backpropagation frameworks such as DAG-DB (Wren et al., 2022) operate on sampled binary adjacency matrices, updating distributions over the DAG space directly without relaxation to the continuous domain, employing straight-through estimation or implicit maximum likelihood estimation for gradient flow.

6. Extensions: Foundations, Resource Constraints, and New Use Cases

Recent works expand the classical DAG learning paradigm in several directions:

Foundation model approaches (e.g., ADAG (Yin et al., 23 Jun 2025)) employ pre-training with linear transformers to learn a shared representation of causal structures across multiple domains, enabling zero-shot inference and overcoming the identifiability limitations in small-sample regimes via multi-task regularization.
The concept of grammar-based sequential learning (2505.22949) introduces a context-free graph grammar construction, uniquely encoding DAGs as sequences of production rules. This injective, lossless transformation supports unambiguous generative modeling, latent space property prediction, and sequential Bayesian optimization over the DAG manifold.
Resource-constrained prediction frameworks cast DAGs as policies for adaptive sensor (feature) acquisition, where nodes correspond to sensor sets and edges represent acquisition/classification decisions, optimized via cost-sensitive empirical risk minimization (Wang et al., 2015).
Dynamic and time-varying graph settings are considered with coupled contemporaneous and lagged relationships, as in GraphNOTEARS (Fan et al., 2022), which leverages both temporal dependencies and network structure in dynamic graphs via smooth optimization and joint acyclicity constraints.
Local structure learning in high-dimensional graphs emphasizes estimation of neighborhoods around user-specified targets, promoting inference quality and computational tractability by restricting estimation scopes (Smith et al., 24 May 2024).

7. Empirical Performance, Applications, and Open Directions

Empirical verification spans broad domains, including gene regulatory networks, protein signaling pathways, economics, neuroscience, and automated machine learning. Theoretical advances in computational tractability (e.g., convex relaxation, dynamic programming, boosting, and efficient hill climbing (Wang et al., 2014)) complement practical demonstrations—simulation studies highlight improvements in false discovery rates, structural Hamming distance, scalability, and interpretability.

Notable application-specific adaptations include:

High-precision inference in multivariate count data, using methods tailored to Poisson and overdispersed models (Nguyen et al., 7 Jun 2024).
Highly efficient hill climbing algorithms, early stopping, and model constraints embedded in software packages (e.g., dagbag (Wang et al., 2014)).
Grammar-based compression and autoencoding supporting generative graph modeling (2505.22949).
Attention-based architectures providing low runtime, high data efficiency, and cross-task transferability in foundation model settings (Yin et al., 23 Jun 2025).

Despite rapid progress, outstanding challenges persist: scaling to ultra-high-dimensional networks with minimal assumptions, dealing with latent confounding and incomplete observations, and resolving identifiability up to full equivalence classes. The field continues to develop principled regularization, improved statistical guarantees, and tight integration with domain knowledge for enabling accurate, interpretable, and scalable DAG recovery.