Posterior DAG Selection Consistency

Updated 6 August 2025

The paper demonstrates that Bayesian procedures asymptotically concentrate on the true causal DAG by employing complexity-penalizing priors and leveraging non-Gaussian error structures.
Bayesian hierarchical models using Laplace-based likelihoods and scale mixtures of Gaussians underpin the methodology, enabling accurate separation of competing equivalence classes.
Simulation studies confirm that tailored non-local priors and risk-based Bayes factor analyses yield robust selection consistency even amidst model misspecification.

Posterior DAG selection consistency refers to the property that as sample size increases, the Bayesian posterior probability assigned to the correct data-generating directed acyclic graph (DAG)—or the appropriate equivalence class of DAGs—converges to unity. This concept is foundational for justifying Bayesian causal discovery and structure learning procedures in high-dimensional settings. The fundamental goal is to ensure that, with increasing data, the Bayesian procedure asymptotically selects the true causal structure or the maximal set of structures that can be identified from observational data, subject to inherent identifiability constraints.

1. Bayesian Hierarchical Models for Causal Structure Learning

A central framework throughout the literature is the Bayesian hierarchical model for linear recursive structural equation models (SEMs) with independent errors. Each variable $X_j$ is modeled as a linear function of its parent variables in a DAG $\gamma$ plus a stochastic noise term: $X_j = \sum_{k \in \text{pa}^\gamma(j)} b_{jk}^\gamma X_k + e_j^\gamma$ where $\text{pa}^\gamma(j)$ denotes the parent set of node $j$ in DAG $\gamma$ , and $b_{jk}^\gamma$ are regression coefficients. The posterior over DAG structures (and, often, associated regression and error parameters) is induced by placing priors—potentially empirical or complexity-penalizing—on both the structure and the parameters. The likelihood is typically formed under a working error model, such as the Laplace distribution for analytical tractability, even though the true errors may be more general (Chaudhuri et al., 1 Aug 2025).

This hierarchical specification underlies the calculation of posterior probabilities for individual DAGs, accommodates marginalization over uncertainty in regression effects and error variances, and supports the essential mechanism for distinguishing among competing DAG structures as $n$ increases.

2. Role of Non-Gaussian Errors and Identifiability

Posterior selection consistency is fundamentally limited by identifiability constraints. In classical Gaussian SEMs, DAGs are generally only identifiable up to their Markov equivalence class. However, if some or all error variables are non-Gaussian—more precisely, if the true error law is a scale mixture of Gaussians with a nondegenerate mixing distribution—then additional information becomes available. For each non-Gaussian error node $j$ (denoted as $j\in \text{nG}^*$ ), the set of parent nodes must be preserved for any distributionally equivalent DAG, and the observed joint distribution can be used to refine the equivalence class correspondingly (Chaudhuri et al., 1 Aug 2025).

The combination of linear structure and non-Gaussian errors allows Bayesian procedures to consistently identify the true DAG up to the "distribution equivalence class," which encodes both Markov properties and the locations of non-Gaussian error nodes. This is a strict refinement of the usual Markov equivalence class and is the most one can hope to recover from observational data in this setting.

3. Distribution Equivalence Class and Its Characterization

The distribution equivalence class, denoted as $E(\gamma^*, \text{nG}^*)$ , is the set of all DAGs $\gamma$ and non-Gaussian index sets $nG$ such that $(\gamma, nG)$ yields the same joint observational distribution as $(\gamma^*, \text{nG}^*)$ . The necessary and sufficient conditions for $(\gamma, nG) \in E(\gamma^*, \text{nG}^*)$ are:

The parent set for each $j \in \text{nG}^*$ must coincide in $\gamma$ and $\gamma^*$ .
The Markov properties (conditional independence structure) must match.

This characterization sharply delineates the target of posterior concentration and justifies why, even with infinite data, there may be multiple modes in the Bayesian posterior corresponding exactly to the elements of $E(\gamma^*, \text{nG}^*)$ (Chaudhuri et al., 1 Aug 2025).

4. Non-standard Complexity Priors and Posterior Concentration

Bayesian model selection in the presence of distribution or risk equivalence classes demands special attention to prior specification. When $E^*$ —the set of minimal-complexity DAGs with optimal fit—contains more than one element, a uniform prior will asymptotically split posterior mass among them. To prevent the posterior from allocating substantial mass to models with spurious extra edges, the introduction of non-local or complexity-penalizing priors is essential.

A typical form is

$\pi_g(\gamma) \propto \exp(- n^{\alpha} d_n \cdot |\gamma|)$

where $\alpha\in(\frac{1}{2},1)$ and $d_n$ is suitably bounded. This prior places exponentially increasing penalty on the number of edges, thus ensuring that, asymptotically, only minimal-size DAGs in $E(\gamma^*, \text{nG}^*)$ retain substantive posterior probability (Chaudhuri et al., 1 Aug 2025). In cases where $E^*$ is a singleton, uniform or mild penalization suffices.

5. Asymptotic Theory: Bayesian Selection Consistency

Theoretical guarantees are established by precise Bayes factor analysis. The log Bayes factor comparing a true minimal DAG $\gamma^*$ and a competing DAG $\gamma$ exhibits distinct behavior depending on risk difference $\delta_\gamma$ and edge number difference $\psi_\gamma = |\gamma| - |\gamma^*|$ : $\log BF(\gamma^*, \gamma) = \begin{cases} \frac{\psi_\gamma}{2} \log n + O_p(1) & \text{if %%%%28%%%% is a supergraph of %%%%29%%%%} \ n \cdot \delta_\gamma + \frac{\psi_\gamma}{2} \log n + O_p(\sqrt{n}) & \text{otherwise}. \end{cases}$ Any $\gamma$ with a different risk is exponentially unattractive; among risk-equivalent graphs, the prior penalizes larger edge sets, ensuring the posterior ratio for non-minimal elements in $E^*$ vanishes (Chaudhuri et al., 1 Aug 2025).

Subject to mild conditions (finite second moments for the error mixing variables, mild faithfulness), the posterior probability assigned to $E(\gamma^*, \text{nG}^*)$ converges to 1. This posterior DAG selection consistency holds even under severe model misspecification (Laplace working model vs. true scale mixture of Gaussians).

6. Simulation Studies and Practical Consequences

Simulations covering both identifiability regimes verify these theoretical results:

When the distribution equivalence class is a singleton, the posterior probability of the true DAG (under a uniform prior) converges sharply to one.
When the equivalence class contains multiple minimal DAGs, non-local priors with heavy complexity penalty concentrate the posterior on the minimal-size elements, suppressing larger (overfitted) graphs.
The empirical distribution of posterior mass among risk-equivalent structures accords with theoretical predictions (e.g., Bernoulli(1/2) splitting when exactly two minimal DAGs are risk-equivalent).

These findings validate the use of misspecified working models, provided priors and the selection procedure accommodate both identifiability features and penalize unnecessary edge complexity appropriately. The approach extends to settings with arbitrarily many variables and complex error distributions, provided the required moment and minimal signal conditions are satisfied.

7. Implications for Bayesian Causal Discovery

The established posterior DAG selection consistency under general error distributions provides rigorous support for Bayesian causal discovery procedures in high-dimensional and potentially non-Gaussian situations. Even with model misspecification, the combination of Laplace-based working models, flexible scale-mixture error assumptions, and complexity-sensitive priors ensures consistent recovery—up to the sharpest identifiability limit achievable from observational data—of the underlying DAG structure (Chaudhuri et al., 1 Aug 2025).

The analysis elucidates both the strengths (robustness, broad error distributional applicability, strong consistency, minimal prior requirements in uniquely identifiable regimes) and the necessity for careful design (non-uniform priors for resolving non-uniqueness) in real-world applications. Future work may extend these ideas to non-linear SEMs, partial observation settings, or fully nonparametric models, building on the demonstrated consistency properties in the linear, conditionally independent error regime.

PDF Markdown Chat (Pro)

References (1)

Consistent DAG selection for Bayesian causal discovery under general error distributions (2025)

Follow Topic

Get notified by email when new papers are published related to Posterior DAG Selection Consistency.