Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives
Published 2 Apr 2026 in cs.LG and stat.ML | (2604.02250v1)
Abstract: Understanding causal dependencies in observational data is critical for informing decision-making. These relationships are often modeled as Bayesian Networks (BNs) and Directed Acyclic Graphs (DAGs). Existing methods, such as NOTEARS and DAG-GNN, often face issues with scalability and stability in high-dimensional data, especially when there is a feature-sample imbalance. Here, we show that the denoising score matching objective of diffusion models could smooth the gradients for faster, more stable convergence. We also propose an adaptive k-hop acyclicity constraint that improves runtime over existing solutions that require matrix inversion. We name this framework Denoising Diffusion Causal Discovery (DDCD). Unlike generative diffusion models, DDCD utilizes the reverse denoising process to infer a parameterized causal structure rather than to generate data. We demonstrate the competitive performance of DDCDs on synthetic benchmarking data. We also show that our methods are practically useful by conducting qualitative analyses on two real-world examples. Code is available at this url: https://github.com/haozhu233/ddcd.
The paper introduces a diffusion denoising objective that smooths gradients to enhance the stability of causal structure learning.
It employs an adaptive k-hop acyclicity constraint that dramatically reduces computational costs while maintaining acyclicity.
Empirical results show improved structure recovery and scalability on both synthetic benchmarks and real-world clinical datasets.
Diffusion Denoising Objectives for Scalable, Stable Causal Structure Learning
Introduction
The landscape of causal structure learning in observational settingsโcritical for domains like biology, medicine, and econometricsโhas seen substantial advances through continuous optimization approaches applied to Bayesian Networks (BNs) and Directed Acyclic Graphs (DAGs). However, exact acyclicity enforcement (e.g., as in NOTEARS) often entails O(d3) computational complexity that impedes scalability, and practical convergence can be unstable in high-dimensional or low-sample regimes. The paper "Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives" (2604.02250) introduces a principled approach wherein the denoising score-matching objective from diffusion generative models is repurposed for parameterized causal inference. The resulting denoising diffusion causal discovery (DDCD) framework delivers scalable, stable structure discovery, possessing theoretical guarantees and robust empirical results.
Methodology
Denoising Objective for Causal Structure Learning
Central to the DDCD approach is the adaptation of the denoising score matching objective from DDPMs. Rather than generating synthetic data, the reverse process is employed to directly parameterize the adjacency matrix W of the underlying SEM, learning to denoise perturbed observations. The denoising objective for a linear SEM,
is algebraically equivalent (Theorem 1 in the paper) to standard SEM reconstruction loss under typical assumptions: causal sufficiency, i.i.d. sampling, acyclicity, additive noise model, error independence, and faithfulness. Thus, optimizing the denoising loss yields valid causal structure estimates while simultaneously regularizing the loss landscape.
Practically, by perturbing the data across multiple noise levels during training, the loss surface is smoothed via a randomized smoothing perspective. This regularization provably bounds the gradient Lipschitz constant, mitigating sharp local minima that plague direct optimization, especially in data-starved regimes.
Adaptive k-hop Acyclicity Constraint
A primary computational bottleneck in continuous DAG learning is the enforcement of acyclicity, classically requiring matrix exponentiation. DDCD introduces an adaptive k-hop constraint, where the acyclicity term is replaced with sums over traces of local powers of W. By annealing k from a small value (capturing local cycles) in early optimization toward k=d (full global acyclicity) as training converges, the method accelerates runtime substantially without sacrificing theoretical validity. Empirically, this curriculum yields approximately an order-of-magnitude speedup compared to always-enforcing global acyclicity, with identical acyclicity satisfaction in final graphs.
Permutation-Invariant Batch Sampling
Another computational innovation is the adoption of fixed-size, permutation-invariant mini-batch re-sampling for gradient updates, formalized under the Deep Sets framework. This sampling decouples convergence behavior and batch complexity from the total number of samples, thus enhancing stability and scalability and supporting applications with highly unequal sample and feature dimensions.
Model Variants
DDCD Linear: Directly parameterizes W for linear SEMs using the denoising objective and adaptive k-hop constraint.
DDCD Nonlinear: Extends denoising objectives to nonlinear SEMs using a latent autoencoding architecture, incorporating denoising objectives in latent space inference. This is analogous to combining latent diffusion and variational autoencoders under explicit SEM structure priors.
DDCD Smooth: Addresses the "varsortability" artifact by enforcing feature-wise normalization and learning a normalized adjacency matrix. This is crucial for real-world tabular data, where scale heterogeneity can induce spurious directionality in most continuous-optimization approaches (as demonstrated in [reisach2021beware]).
Empirical Results
Optimization Stability and Convergence
Empirical analysis makes the strong claim that DDCD's denoising objective dramatically smooths gradients and accelerates convergence, especially in low-sample or high-dimensional conditions where previous approaches frequently fail to escape poor local optima. The method reduces L-BFGS-B optimizer iterations by an order of magnitude over NOTEARS-Linear for identical acyclicity/SEM losses, with stabilized gradient norms across optimization.
Structure Recovery and Scalability
On synthetic benchmarks (scale-free and Erdลs-Rรฉnyi DAGs, up to 2000 nodes and nonlinear mechanisms), DDCD-Linear and DDCD-Nonlinear match or exceed the performance of current SOTA (DAGMA, GOLEM, DAG-GNN, etc.) in both structure recovery (TPR, FDR, SHD) and runtime. Whereas traditional methods require minutes to hours for large graphs, DDCD models complete 5000 optimization steps on 2000-node graphs in under 6 minutes on GPU, while competitors often do not finish within several hoursโdemonstrating superior computational scaling. Notably, the k-hop acyclicity curriculum is responsible for nearly 90% of the runtime reduction compared to full global acyclicity computation.
In structure recovery on nonlinear SEMs, DDCD-Nonlinear achieves lowest error under smooth (e.g., cos, sin) and activation-function-driven relationships in most cases, though quadratic/sigmoid nonlinearities remain challenging across all methods. DDCD-Nonlinear recovers not only structure but also accurate transformation functions, supporting joint structure-function learning.
Real-World Analysis
On clinical data (myocardial infarction, aging cohorts), DDCD-Smooth induces meaningful higher-order clusters in the inferred causal graphs, conforming to domain knowledge (e.g., causal relations among myocardial rupture, cardiogenic shock, and lethal outcome). By contrast, NOTEARS frequently produces implausible hub-and-spoke structures concentrated on a few central nodes. DDCD-Smooth reduces spurious edges due to scale artifacts and delivers sparser, more interpretable models. Inferred edge directions are often correct (though not always), and the normalized formulation exhibits robustness to feature-scale heterogeneity absent in prior approaches.
Theoretical and Practical Implications
The core theoretical implication is that gradient smoothing induced by denoising-style objectives leads to significant improvements in both stability and convergence speed for continuous DAG structure learning, especially in challenging statistical regimes. The equivalence to the standard loss ensures identifiability is preserved. Moreover, the decoupling of batch/sample size from computation and the curriculum approach for acyclicity effectively address limitations of prior large-scale approaches.
Practically, the reduced runtime and improved stability broaden the application of causal discovery to high-dimensional, data-impoverished, or noisy tabular datasets with complex (nonlinear, heterogeneously scaled) relationships. Biomedical and healthcare applicationsโlike EHR data analysis or gene regulatory network inferenceโstand to benefit substantially. The approach is extendable to nonparametric SEMs, mixed data types, and settings with latent variables.
Future Directions
Possible extensions include incorporating prior knowledge in penalty terms (e.g., for fairness constraints), exploring more expressive variational objectives in the denoising process for improved nonlinear identifiability, and integrating interventions. The adaptive acyclicity schedule offers a platform for further algorithmic innovations in structure learning. The framework also invites more rigorous analysis of denoising objectives' impacts on identifiability and generalization, especially in the presence of unfaithfulness or latent confounding.
Conclusion
This work formalizes and validates the use of diffusion denoising objectives for scalable, stable causal structure learning under continuous optimization. The combination of randomized smoothing, adaptive acyclicity constraints, and normalized loss surfaces yields strong empirical and theoretical performance across synthetic and real-world datasets. The DDCD framework significantly advances the frontier for efficient, interpretable, and robust causal discovery in high-dimensional observational regimes (2604.02250).