NOTEARS: Continuous DAG Structure Learning

Updated 22 October 2025

The NOTEARS framework is a score-based method that formulates DAG structure learning as a continuous optimization problem using a differentiable acyclicity constraint.
It leverages a smooth matrix exponential function to ensure acyclicity, overcoming the combinatorial complexity of traditional structure discovery.
Methodological extensions enhance NOTEARS with nonlinear models, scalable algorithms, and adaptive regularization to improve performance in diverse causal inference applications.

The NOTEARS (Non-combinatorial Optimization via Trace Exponential and Augmented lagRangian for Structure learning) framework is a class of score-based methods for learning directed acyclic graph (DAG) structures from observational data. It recasts the combinatorial problem of structure discovery into a continuous optimization problem by encoding acyclicity as a differentiable constraint, enabling the use of efficient gradient-based algorithms. Since its introduction, NOTEARS has served as a foundation for numerous theoretical, methodological, and applied advances in causal and probabilistic graphical models research.

1. Continuous Optimization Formulation and Acyclicity Constraint

The core innovation of NOTEARS is the formulation of DAG structure learning as a constrained continuous optimization problem over a weighted adjacency matrix $W \in \mathbb{R}^{d \times d}$ . This is achieved by introducing a smooth algebraic function that vanishes if and only if $W$ describes an acyclic graph. In the canonical linear structural equation modeling case, the minimization objective is

$\min_{W} \quad F(W) = \ell(W; X) + \lambda \|W\|_1 \quad \text{subject to} \quad h(W) = 0,$

where $\ell(W; X)$ is typically the least squares loss $\ell(W; X) = (1/2n)\|X - XW\|_F^2$ , $\lambda$ controls sparsity, and $h(W)$ is the acyclicity constraint. The original formulation employs

$h(W) = \operatorname{tr}\left(\exp(W \circ W)\right) - d = 0,$

with $\circ$ denoting the Hadamard (element-wise) product. This constraint ensures that $W$ represents a DAG, as the matrix exponential’s trace equals the number of nodes $d$ only if $W$ is nilpotent (i.e., contains no cycles) (Lachapelle et al., 2019, Wei et al., 2020).

This continuous relaxation enables the use of augmented Lagrangian or projected-gradient methods, overcoming the super-exponential complexity of combinatorial DAG search.

2. Methodological Extensions and Algorithmic Developments

Subsequent works have extended the NOTEARS framework along several axes.

Nonlinear and Nonparametric Models:

GraN-DAG generalizes NOTEARS to capture nonlinear relationships by learning, for each variable $X_j$ , a neural-network-based conditional model with parameters $\phi_j$ . A neural connectivity matrix $A_\phi$ is derived from NN weights, and the acyclicity constraint is enforced as $h(\phi) = \operatorname{tr}\exp(A_\phi) - d = 0$ . The optimization proceeds over all network parameters simultaneously via stochastic gradient algorithms and an augmented Lagrangian penalty (Lachapelle et al., 2019).

Computational Scalability:

Matrix exponential-based acyclicity constraints have $O(d^3)$ cost. Alternatives such as NO-BEARS substitute the trace exponential constraint with a spectral radius condition $\rho(W \circ W) = 0$ , using fast power iteration and enabling $O(d^2)$ scaling. GPU-implementation further accelerates execution for high-dimensional settings (Lee et al., 2019). LoRAM employs a low-rank factorization and sparsification to achieve $O(d^2 r)$ complexity, facilitating DAG learning with thousands of nodes (Dong et al., 2022).

Loss and Regularization Variants:

Extensions such as NOTEARS-AL integrate adaptive Lasso penalties to encourage data-driven sparsity, providing oracle properties under specified asymptotics. This approach avoids heuristic post-thresholding, instead applying penalty multipliers $c_{ij}$ based on preliminary estimates (Xu et al., 2022). Polynomial regression and CNN-based models enable the modeling of nonlinearities essential for processes such as gene expression (Lee et al., 2019), time series (Sun et al., 2021), and dynamic systems.

Optimization Strategies and Postprocessing:

The nonconvexity of the constraint raises challenges for convergence and KKT stationarity. KKTS postprocessing algorithms exploit necessary and sufficient KKT conditions to iteratively remove (or reverse) edges, achieving substantial improvements in metrics such as the structural Hamming distance (SHD) (Wei et al., 2020). Bi-level algorithms utilizing topological swaps explore the space of topological orders, alternating between inner continuous minimization (given an order) and outer discrete order updates to escape poor local minima (Deng et al., 2023).

Handling Prior Knowledge and Constraints:

The framework is amenable to incorporating domain expertise. Expert-provided edge constraints can be encoded as equality or inequality restrictions on $W$ (e.g., $W_{ij} = 0$ to forbid, $|W_{ij}| \geq \text{thresh}$ to enforce) and are managed within the augmented Lagrangian optimization (Chowdhury et al., 2023, Chen et al., 2023).

3. Theoretical Guarantees and Limitations

Guarantees:

The acyclicity constraint based on the trace of the matrix exponential or its spectral variants is both necessary and sufficient for DAG structure (Wei et al., 2020, Lee et al., 2019).
In the large-sample limit and under correct model specification (e.g., linear or appropriately chosen nonlinear additive noise models), the solution enjoys structure recovery guarantees (Lachapelle et al., 2019, Xu et al., 2022).
Adaptive Lasso extensions recover both sparsity and asymptotic normality under specified growth rates for penalty parameters (Xu et al., 2022).

Limitations and Pitfalls:

Lack of Scale Invariance:

A central limitation is sensitivity to variable scaling. Both theoretical analysis and empirical demonstrations have shown that rescaling variables (e.g., unit conversion or normalization) can arbitrarily alter the learned DAG, reversing edge orientations or introducing spurious connections. The root cause lies in the least squares loss, which aligns structure selection with the ordering of variable variances (so-called "varsortability") rather than causal mechanisms per se (Kaiser et al., 2021, Seng et al., 2022). Consequently, NOTEARS does not possess the transportability necessary for out-of-domain generalization without preprocessing or modified loss (e.g., via Mahalanobis distance or noise-variance-aware objectives as in GOLEM).

Optimization to Non-Global Minima:

Due to nonconvexity, methods converge only to stationary points. For the quadratic mapping $A = W \circ W$ , gradients vanish at any feasible solution, impeding satisfaction of the KKT optimality conditions except in trivial cases. Alternative formulations (e.g., $A = |W|$ or explicit decomposition into $W^+$ , $W^-$ ) admit meaningful KKT conditions (Wei et al., 2020).

Post-hoc Acyclicity and Truncation:

Solutions from gradient-based optimization may not be strictly acyclic. Heuristic truncation of small weights can invalidate or perturb optimality. MILP-based postprocessors (e.g., DAGs with Tears) offer systematic edge removal to achieve acyclicity with minimal loss penalty, and can further encode domain or logical constraints (Chen et al., 2023).

Hyperparameter Sensitivity:

Edge recovery is highly sensitive to the regularization parameter $\lambda$ . Small changes can substantially affect estimated graphs. Two-level stability selection—bootstrapping over datasets and sweeping $\lambda$ —retains only edges stable across both, improving reliability in downstream causal inference (e.g., in post-transcriptional regulation studies) (Martos et al., 14 Oct 2025).

4. Applications and Empirical Results

NOTEARS has been successfully deployed in diverse settings:

Causal Inference in Biology:

In transcriptomics, NO-BEARS and extensions reconstruct gene regulatory networks with improved average precision and scalability (Lee et al., 2019). NOTEARS is used to map post-transcriptional regulatory timelines in Arabidopsis, integrating missing data imputation (via EM), stability selection, and expert input (Martos et al., 14 Oct 2025).

Neuroscience and Federated Learning:

NOTEARS-PFL applies federated optimization to learn site-personalized brain networks in major depressive disorder, using group fused Lasso penalties to share information across multiple centers without sharing raw data (Liu et al., 2023).

Dynamic and Time Series Models:

NTS-NOTEARS leverages 1D CNNs with acyclicity constraints for structure discovery in dynamic Bayesian networks, achieving performance gains in both synthetic and real-world sequential data (Sun et al., 2021). Online variants employing sequential regression and adapted acyclicity constraints have been proposed for wireless networks, including explicit theoretical bounds on detection delay for structural events (Giwa, 24 May 2025).

Urban and Infrastructure Optimization:

Integrating empirical charging data and urban features, NOTEARS informs EV charging station placement strategies, linking high-probability demand to proximity of amenities and traffic, and embedding these findings in optimization-based siting models (Junker et al., 21 Mar 2025).

Empirical comparisons consistently show that NOTEARS and its descendants achieve low SHD and SID on synthetic additive noise models, and outperform or compete with greedy search and other score-based approaches on real data. Notable improvements in computation time (via spectral or low-rank methods) and accuracy (via nonlinear models or postprocessing) have been established.

5. Practical Implementation Considerations

Optimization:

The base algorithm employs augmented Lagrangian updates for acyclicity enforcement, with gradient-based methods (e.g., RMSprop, ADAM) and line search for step-size selection. Convergence monitoring includes residual norm and acyclicity violation. For nonlinear models (GraN-DAG), NN parameters are fitted per variable.

Postprocessing:

KKT-based and MILP methods are adopted for correcting non-acyclic outputs or enforcing prior knowledge. In practice, small weights are handled via data-driven or adaptive thresholding rather than fixed constants.

Scalability:

For large $d$ , GPU acceleration and low-rank decomposition are indispensable. Methods such as NO-BEARS and LoRAM reduce both memory footprint and computational complexity ( $O(d^2)$ vs. $O(d^3)$ ), validated on datasets with thousands of nodes.

Hyperparameter Selection:

Stability selection strategies, including bootstrapping and cross-validated $\lambda$ grids, mitigate sensitivity to tuning and extract robust edges in the DAG.

Integration of Prior Knowledge:

Human-in-the-loop procedures and logical constraints can be imposed as mixed integer or equality/inequality restrictions, systematically biasing structure estimation toward expert knowledge (Chowdhury et al., 2023, Chen et al., 2023).

6. Controversies and Open Problems

The most prominent controversy is the lack of scale invariance and vulnerability to variance manipulation—demonstrably resulting in arbitrary DAG selection with simple rescaling attacks (Kaiser et al., 2021, Seng et al., 2022). This limitation fundamentally challenges the suitability of NOTEARS for identifying true underlying causal relations without preprocessing or alternative loss formulations. Approaches such as GOLEM, which adopt noise-variance-aware scores, and Mahalanobis-based penalization offer partial remedies but do not fully resolve the issue.

Another open direction is improving theoretical guarantees under model misspecification and high-dimensional, low-sample regimes, as well as scaling to even larger graphical structures with minimal accuracy loss. Integrating transportability and domain-invariant constraints into the differentiable learning pipeline remains an active research area (Berrevoets et al., 2022).

7. Future Directions

Broader adoption of scalable approximations (e.g., low-rank, spectral, and randomized algorithms) is expected to further increase the practical range of NOTEARS-based methods in genomics, systems biology, and infrastructure planning.
Advancing robust and stable loss formulations, including integrating variance-normalized or non-square-loss functions, is essential to achieve scale-invariant and transportable causal discovery.
Enhanced optimization approaches, including meta-learning for hyperparameter tuning and hybrid discrete-continuous optimization (e.g., bi-level topological swaps), are likely to push the frontier of both accuracy and efficiency.
Cross-disciplinary applications, increased integration with domain knowledge, and incorporation of nonparametric and non-Gaussian models are poised to extend the impact and flexibility of the NOTEARS paradigm.

In summary, the NOTEARS framework has established the differentiable, continuous optimization paradigm as a central approach for DAG structure learning in high-dimensional, complex systems. Its strengths—computational efficiency, extensibility to nonlinearity, and compatibility with deep learning—are counterbalanced by the need for careful regularization, scale normalization, and robust postprocessing. Ongoing methodological innovations continue to address its limitations and broaden its applicability across domains and data modalities.