Non-Standard DAG Priors

Updated 6 August 2025

Non-standard DAG priors are distributions over Directed Acyclic Graphs that depart from traditional conjugate families to better capture sparsity and structural constraints.
They employ techniques such as DAG-Wishart, P-Dirichlet, and minimal Hoppe–Beta to enable node-specific flexibility and efficient inference in complex models.
These priors enhance Bayesian structure learning by boosting model selection accuracy and incorporating expert domain knowledge in high-dimensional settings.

A non-standard DAG prior is any prior distribution over the space of Directed Acyclic Graphs (DAGs), their parameters, or both, that is not a direct instantiation of traditional, conjugate, or modular families such as the normal-Wishart, hyper Dirichlet, or uniform priors over graph structures. The paper and development of non-standard DAG priors is driven by the limitations of classical priors in encoding sparsity, modular independence, structural constraints, or adaptation to high-dimensional or non-Gaussian settings. Advances in this area have yielded theoretical guarantees, efficient inference methods, and greater adaptability to domain knowledge across Gaussian and discrete models, as well as enabled robust Bayesian structure learning in complex data regimes.

1. Conceptual Foundations and Limitations of Standard DAG Priors

Traditional Bayesian DAG modeling uses priors that are selected for their conjugacy, convenience of marginal likelihood computation, and propagation of independence properties. For Gaussian DAGs, the normal-Wishart prior has been shown to be the unique parameter prior for all complete Gaussian DAG models that satisfies global parameter independence and complete model equivalence (i.e., invariance under relabeling and transformation of parameters of equivalent models) (Geiger et al., 2013, Geiger et al., 2021). Explicitly, for a mean vector $p$ and a precision matrix $W$ , the joint prior must be

$p(p,W|m) = N(p|\mu, \alpha W) \cdot \text{Wishart}(W | a, T).$

Any deviation from this structure—such as introducing hierarchical dependencies, non-conjugate forms, or priors with support on a restricted or extended model class—constitutes a non-standard DAG prior.

Standard priors are limited in their ability to:

Express heterogeneity (assigning per-node or per-edge hyperparameters for increased granularity)
Encode structural constraints (e.g., prior beliefs on edge presence, ordering, or sparsity)
Reflect expert knowledge not compatible with modular independence or model equivalence
Yield robust inference under misspecification or when the space of admissible DAGs is restricted

These limitations motivate the construction of non-standard priors that relinquish certain theoretical properties for practical expressiveness or computational tractability.

2. Non-Standard DAG Priors for Gaussian Directed Acyclic Graph Models

The DAG-Wishart family introduced by (Ben-David et al., 2011) extends classical conjugate priors to arbitrary DAGs using the Cholesky parametrization,

$\Sigma^{-1} = L D^{-1} L^{T},$

where $D$ is diagonal and $L$ is unit-lower-triangular with $L_{ij}=0$ when $i$ is not a parent of $j$ . The DAG-Wishart prior is given as

$\pi_{U,\alpha}^{\Theta_D}(L, D) = \frac{1}{z_D(U, \alpha)} \exp\left\{-\frac{1}{2} \text{tr}(L D^{-1} L^{T} U)\right\} \prod_{i=1}^p D_{ii}^{-\frac{1}{2}\alpha_i}.$

Here, the shape parameter vector $\alpha$ allows for nodewise flexibility that is not present in the classical Wishart (scalar parameter) prior.

DAG-Wishart priors possess strong hyper Markov properties, enabling:

Independence between local regression coefficients and conditional variances across nodes
Closed-form posteriors upon observing Gaussian data: prior $\alpha_i$ updated to $\alpha_i+n$ and prior scale $U$ to $U+nS$ (with $S$ the empirical covariance)
Conjugacy for separate regression problems imposed by the DAG, crucial for scalable inference in high dimensions

The flexibility and nodewise indexing of the prior enable explicit control of regularization across nodes and facilitate nodewise decoupling for computational efficiency. This is especially significant in settings with arbitrary (non-decomposable, non-perfect) DAGs where the space of precision or covariance matrices compatible with the DAG is a curved manifold of measure zero. The construction via Cholesky parametrization and appropriate projections allows for densities to be defined with respect to Lebesgue measure.

3. Non-Standard Priors for Discrete DAGs and Restricted Structural Classes

The P-Dirichlet family, developed for discrete models Markov with respect to a family $\mathcal{P}$ of DAGs with skeleton (undirected structure) given by a decomposable graph $G$ (Massam et al., 2014), generalizes the hyper Dirichlet by allowing greater hyperparameter flexibility and enabling the encoding of edge direction constraints imposed by practitioners. For a set of allowed DAGs $\mathcal{P}$ , the P-Dirichlet is defined by placing independent Dirichlet priors on the conditional cell probability vectors for each node and every parent configuration, but only for those DAGs in $\mathcal{P}$ . The parameter space involves collections $\mathcal{Q}$ and $\mathcal{P}$ , which extend the usual cliques and separators, giving higher-dimensional hyperparameterization and thus more flexible prior modeling.

The method of moments is utilized to derive the moments of the P-Dirichlet in terms of its hyperparameters,

$\mathbb{E}[\prod_{i\in I} p(i)^{r_i}] = \frac{\prod_{A \in \mathcal{Q}} \prod_{m \in I_A} (\nu^A_m)^{r^A_m}}{\prod_{B \in \mathcal{P}} \prod_{n \in I_B} (\mu^B_n)^{r^B_n}},$

allowing characterization and practical elicitation even without the existence of a density. The P-Dirichlet accommodates model spaces where only a subset of DAGs respecting user-imposed directional constraints is plausible, and the hyper Markov property is preserved within this expanded flexibility.

4. Priors Inducing Sparsity, Structured Orderings, and Edge-Specific Control

Non-standard priors that encode explicit structural biases over graphs—distinct from standard uniform priors—include the minimal Hoppe–Beta prior (Rios et al., 2015), which defines a prior over DAGs that favors sparser layers and hierarchical block structures. Nodes are assigned to layers (blocks) using a Hoppe–Ewens urn process, and directed edges are only allowed from lower to higher layers, with Beta-distributed edge probabilities. The prior probability of a DAG is given by explicit marginalization over latent parameters,

$P_{G, z}(G, z) = \cdots \prod_{1 \leq a < b \leq K} \frac{B(\beta_1 + N_{a,b}, \beta_2 + M_{a,b})}{B(\beta_1, \beta_2)},$

where $B(\cdot, \cdot)$ is the beta function and $N_{a,b}$ , $M_{a,b}$ count observed and missing edges, respectively, from class $a$ to class $b$ .

Other priors adaptively induce sparsity over the Cholesky factor of the Gaussian model. The beta-mixture and multiplicative priors (Cao et al., 2019) define distributions on the binary sparsity indicators $Z$ of Cholesky factor entries, as either

$\pi(Z|q) = \prod_{k>j} q^{Z_{kj}} (1-q)^{1-Z_{kj}}, \quad q \sim \text{Beta}(\alpha_1, \alpha_2),$

or, in the multiplicative case,

$\pi(Z|\omega_1, ..., \omega_p) = \prod_{k>j} (\omega_k \omega_j)^{Z_{kj}}(1-\omega_k \omega_j)^{1-Z_{kj}}, \quad \omega_j \sim \text{Beta}(\alpha_1, \alpha_2).$

By combining these with spike-and-slab priors on the Cholesky entries, model selection consistency is achieved under substantially relaxed conditions on $p$ and sparsity.

Nonlocal priors over Cholesky parameters, with density exactly zero at $L_{ji}=0$ ,

$\pi_{nl}(D, L| \mathcal{D}) \propto \prod_i \frac{1}{D_{ii} \prod_{j \in pa_i(\mathcal{D})} L_{ji}^{2r}},$

penalize edge presence more aggressively and yield high-dimensional model selection consistency (Cao et al., 2016).

5. Non-Standard Priors, Model Selection, and Identifiability under Non-Gaussianity

The identification of the true DAG (rather than just its Markov equivalence class) from purely observational data is in general impossible under Gaussian errors. By imposing non-Gaussianity for some or all nodes, estimation procedures can exploit higher-order independence properties for identifiability (Chaudhuri et al., 1 Aug 2025).

In such settings, the prior over graph structures penalizes complexity in a sample-size dependent manner,

$\pi_g(\gamma) \propto \exp\{-n^{\alpha} d_n |\gamma|\},$

where $|\gamma|$ is the number of directed edges. This “non-standard” complexity prior differs from uniform model priors by dynamically shrinking the posterior mass onto distributionally equivalent DAGs with the fewest edges, consistent with the “parental preservation” property when non-Gaussian errors are present. Posterior consistency is achieved for the true distribution equivalence class as $n\to\infty$ , regardless of error distribution details, by employing such complexity priors in conjunction with Laplace likelihood approximations.

6. Practical Impact and Numerical Behavior

Non-standard DAG priors have been shown empirically to improve sensitivity and specificity in model selection compared to methods relying on standard priors or uniform random structure sampling (Ben-David et al., 2011, Rios et al., 2015). With scalable computational techniques leveraging the independence and locality properties of Cholesky-based priors or block-structured models, these methods are applicable in high-dimensional data regimes (e.g., $p$ up to 2000) and have been validated on real datasets (e.g., cell signaling, call center forecasting), yielding improvements in predictive accuracy, risk under squared error or Stein’s loss, and robustness to contamination.

Algorithmic tools developed for inference under non-standard priors include nodewise decoupling, Euclidean projection methods, stochastic shotgun search (SSS) algorithms for model selection, and Monte Carlo schemes adapted to the prior structure. The explicit marginal likelihoods and closed-form normalizing constants provided by non-standard conjugate and hyper Markov priors further enable efficient Bayes model averaging, posterior computation, and parameter estimation.

7. Implications for Future Research and Model Construction

The body of results on non-standard DAG priors establishes the need to balance structural flexibility, conjugacy, parameter modularity, and computational tractability. Key insights include:

Non-standard priors (whether via Cholesky parametrization, hyperparameter expanded Dirichlet/hyper Dirichlet, or explicit graph complexity penalties) enable expression of richer prior beliefs without sacrificing key analytic or computational properties.
When designing non-standard priors, careful attention must be given to which independence or equivalence assumptions are relaxed, as violations carry implications for marginal likelihood calculation and the propagation of local-to-global model properties (Geiger et al., 2013).
The techniques developed have influenced high-dimensional inference, Bayesian variable selection, and structure learning for both parametric (Gaussian) and nonparametric or mixed data types, and provide the foundation for further generalization to dynamic, temporally evolving, or context-dependent DAG models.

This comprehensive framework, encompassing both theoretical constructs and computational strategies, broadens the toolkit for Bayesian inference and structure learning in graphical models, with continuing impact on applied domains such as genomics, neuroscience, and econometrics.