BDeu Scoring for Bayesian Networks

Updated 23 March 2026

BDeu scoring is a Bayesian metric that evaluates marginal likelihood for discrete networks by imposing uniform Dirichlet priors, ensuring score equivalence.
It features decomposability and parameter modularity, allowing efficient localized computations and scalable structure learning.
However, BDeu is sensitive to the equivalent sample size parameter and may exhibit pathologies in sparse, unobserved, or skewed data scenarios.

The Bayesian–Dirichlet Equivalent Uniform (BDeu) scoring metric is a canonical, closed-form Bayesian marginal likelihood score used for structure learning in discrete Bayesian networks. BDeu arises from imposing both a uniform Dirichlet prior on each local conditional probability table (CPT) and a score-equivalence constraint, ensuring that Markov-equivalent structures receive identical scores. This uniform, parameter-modular prior is motivated by the absence of domain-specific information, providing a principled default for automated model selection. Despite its widespread adoption owing to decomposability, score equivalence, and computational tractability, BDeu exhibits critical sensitivity to the equivalent sample size (ESS) parameter and displays pathologies in scenarios involving sparse data, unobserved configurations, or strongly skewed marginals.

1. Mathematical Derivation and Defining Properties

The BDeu scoring metric evaluates the marginal likelihood of a dataset $D$ given network structure $G$ as a product over network families:

$P(D \mid G) = \prod_{i=1}^n \prod_{j=1}^{q_i} \frac{\Gamma(\alpha/q_i)}{\Gamma(\alpha/q_i + N_{ij})} \prod_{k=1}^{r_i} \frac{\Gamma(\alpha/(r_i q_i) + N_{ijk})}{\Gamma(\alpha/(r_i q_i))}$

where:

$n$ is the number of variables,
$r_i$ is the number of states of $X_i$ ,
$q_i$ is the number of parent configurations for $X_i$ ,
$N_{ijk}$ is the observed count for $X_i = k$ and parents in configuration $j$ ,
$N_{ij} = \sum_{k=1}^{r_i} N_{ijk}$ ,
$\alpha > 0$ is the ESS.

Imposing $\alpha_{ijk} = \alpha/(r_i q_i)$ for all $i, j, k$ ensures a “uniform” prior per CPT cell and yields likelihood equivalence: all DAGs expressing the same conditional independencies obtain equal scores (Heckerman et al., 2013, Heckerman et al., 2013).

Key formal properties:

Score equivalence: Structures in the same Markov-equivalence class are scored identically.
Parameter modularity: Priors decompose into independent Dirichlet distributions at the family level.
Decomposability: Score computations are localized to affected families, enabling efficient search and update.

2. Rationale for Uniform Dirichlet Priors and Score Equivalence

BDeu is derived by requiring a noninformative, locally uniform Dirichlet prior and score equivalence (Heckerman et al., 2013, Heckerman et al., 2013). For each $i, j$ , the CPT row receives a symmetric Dirichlet prior with total pseudocount $\alpha/q_i$ :

$\alpha_{ijk} = \frac{\alpha}{r_i q_i}, \quad \alpha_{ij} = \frac{\alpha}{q_i}$

This allocation is the unique solution preserving score equivalence under the assumptions of parameter modularity and independence (Heckerman et al., 2013). Under this prior, any two graphs encoding the same set of conditional independencies (i.e., Markov equivalent) receive identical BDeu scores, permitting search in equivalence class–space and avoiding arbitrary preference among equivalent structures.

3. Pathological Behaviors and Sensitivity to Hyperparameters

Despite its theoretical appeal, BDeu exhibits significant sensitivity to the ESS hyperparameter and can behave counterintuitively in certain data regimes (Silander et al., 2012, Scutari, 2017, Ueno, 2012, Suzuki, 2016, Kayaalp et al., 2012):

Sparsity-induced bias: When some parent configurations are unobserved ( $N_{ij}=0$ ), the effective prior mass for $X_i$ is reduced, as unobserved configurations contribute a constant factor to the likelihood and no longer provide regularization. This can favor overly complex models, leading to the addition of spurious arcs to fill the "prior mass deficit" (Scutari, 2017, Suzuki, 2016, Ueno, 2012).
Sensitivity to $\alpha$ : The optimal structure can be highly unstable with respect to $\alpha$ . Empirically, even minor changes in $\alpha$ can induce large changes in the MAP structure, including transitions from empty to fully connected graphs, depending strongly on the underlying data distribution and network arity (Silander et al., 2012, Ueno, 2012).
Entropy violation under sparsity: BDeu fails the maximum relative entropy (MrE) principle in sparse-data settings. Since unobserved configurations receive no prior or data contribution, the comparison between models is no longer fair and model selection incentives are skewed (Scutari, 2017).
Spurious dependencies: BDeu may favor dependence models even when the true structure is independence, e.g., both variables constant or under highly skewed marginals (Kayaalp et al., 2012, Suzuki, 2016). Preference for more complex models can even strengthen with increasing sample size when certain pathological conditions hold.

4. Comparative Analysis and Alternatives

Several studies provide both theoretical and empirical comparison between BDeu, its alternatives, and related scoring functions:

Metric	Prior Mechanism	Score Equivalence	Pathologies	Key References
BDeu	Local uniform Dirichlet per CPT row	Markov-equivalence class	$\alpha$ -sensitivity, sparsity	(Heckerman et al., 2013, Scutari, 2017, Silander et al., 2012)
Jeffreys BD	Dirichlet with parameter $1/2$ per cell	Markov-equivalence class	Robust to false positives	(Suzuki, 2016)
BDs	Uniform prior mass over observed configs	Asymptotic equivalence	More robust to sparsity	(Scutari, 2017, Scutari, 2017, Scutari, 2016)
GU (Global Uniform)	Flat prior over joint consistent with $S$	Markov-equivalence class	Intractable for general $S$	(Kayaalp et al., 2012)

Alternatives such as the Bayesian Dirichlet sparse (BDs) and NIP-BIC reallocate pseudocounts only to observed configurations, preserving the total prior mass and stabilizing structure learning under sparsity (Scutari, 2017, Scutari, 2017, Ueno, 2012). The GU metric, which places a uniform prior over all joint distributions consistent with the structure, addresses many pathological behaviors of BDeu but is computationally intractable for most graphs (Kayaalp et al., 2012).

5. Computational Characteristics

The key advantage of BDeu is its closed-form expression and decomposability:

Scalability: Structure scoring is $O(\sum_i q_i r_i)$ per candidate $G$ , and updates only require recomputation for modified families.
Efficient Search: Score equivalence and modularity allow localized search in lattice of equivalence classes (CPDAGs) using greedy, tabu, or A* algorithms (Heckerman et al., 2013, Heckerman et al., 2013).
Numerical Stability: Practical implementations accumulate log-scores using recurrence caching of $\log\Gamma$ values (Heckerman et al., 2013).
Sensitivity: Parameter updates, arc addition, or removal may trigger significant score changes, particularly near boundaries (small $\alpha$ or zero counts).

6. Guidance for Practical Use and Selection of Hyperparameters

Empirical studies and theoretical analysis offer the following guidance for application of BDeu (Steck, 2012, Ueno, 2012, Scutari, 2017, Silander et al., 2012):

Choice of $\alpha$ : No universal default. Common practice is $\alpha = 1$ , but predictive-optimal $\alpha^*$ depends on data sparsity and CPT skewness, ranging from very small values ( $<10$ ) in highly structured data to much larger values ( $>50$ ) in weakly dependent or uniform data.
Robustness: Integrating over $\alpha$ (empirical Bayes), using cross-validated predictive scores, or switching to BDs/NIP-BIC may mitigate pathologies.
Sparse data: Alternatives (e.g., BDs, GU where feasible) are preferable, as BDeu may add spurious arcs or fail to detect independence.

7. Extensions and Generalizations

BDeu's underlying principles extend to other probabilistic graphical models, notably staged trees. The BDepu ("path-uniform" BDe) score for staged trees satisfies an analogous score-equivalence property by distributing prior mass uniformly over root-to-leaf paths and enforcing a mass conservation constraint (Hughes et al., 2022). The formula mirrors that of BDeu for BNs, but hyperparameters are computed proportional to the number of paths traversing corresponding edges or stages.

References

"Learning Bayesian Networks: The Combination of Knowledge and Statistical Data" (Heckerman et al., 2013)
"Learning Bayesian Networks: A Unification for Discrete and Gaussian Domains" (Heckerman et al., 2013)
"A Bayesian Network Scoring Metric That Is Based On Globally Uniform Parameter Priors" (Kayaalp et al., 2012)
"On Sensitivity of the MAP Bayesian Network Structure to the Equivalent Sample Size Parameter" (Silander et al., 2012)
"Dirichlet Bayesian Network Scores and the Maximum Relative Entropy Principle" (Scutari, 2017)
"A Theoretical Analysis of the BDeu Scores in Bayesian Network Structure Learning" (Suzuki, 2016)
"Robust learning Bayesian networks for prior belief" (Ueno, 2012)
"Beyond Uniform Priors in Bayesian Network Structure Learning" (Scutari, 2017)
"An Empirical-Bayes Score for Discrete Bayesian Networks" (Scutari, 2016)
"Score Equivalence for Staged Trees" (Hughes et al., 2022)
"Learning the Bayesian Network Structure: Dirichlet Prior versus Data" (Steck, 2012)