Bayesian Structural EM Algorithm
- Bayesian Structural EM is a framework for jointly learning network structures and parameters by maximizing the Bayesian posterior in the presence of missing or latent data.
- It alternates between an E-step that computes expected sufficient statistics via Soft or Hard EM and an M-step that optimizes both structure and parameters using a Bayesian Dirichlet score.
- Empirical evaluations show that the approach improves predictive accuracy and reduces KL divergence, making it effective for diverse network sizes and missingness patterns.
The Bayesian Structural EM algorithm is a framework for learning both the structure and parameters of Bayesian networks and related factored probabilistic models directly under the Bayesian posterior criterion, even when the data contain missing values or latent variables. Unlike traditional parameter learning schemes that rely exclusively on complete observed data or penalized likelihood criteria, Bayesian Structural EM alternates between EM-style completion of data and structure search steps, efficiently exploiting observed and expected sufficient statistics to optimize both model topology and parameters. The approach is applicable to networks with discrete or continuous variables, mixtures, multinets, and decision trees, leveraging conjugate priors and factorization for scalable inference and model selection (Friedman, 2013, Ruggieri et al., 2020).
1. Model Selection Objective and Problem Setting
In the context of Bayesian networks, let denote an incomplete data set over discrete variables , and denote a candidate directed acyclic graph (DAG) structure. The Bayesian Structural EM objective is to maximize the joint posterior over structures and parameters :
where is a prior over graph structures and , typically a product of Dirichlet distributions, is the prior over conditional probability table (CPT) parameters. The incompleteness of precludes closed-form scoring of structures due to latent variables or missing data, necessitating iterative statistical imputation and expected score evaluation (Friedman, 2013, Ruggieri et al., 2020).
2. Algorithmic Structure and Iterative Workflow
Bayesian Structural EM proceeds iteratively, alternating between the following steps:
- E-step (Completion/Expected Sufficient Statistics): For the current model , infer expected sufficient statistics for each node-parent configuration using either:
- Soft EM: Marginalizes over missing data via belief propagation on the junction tree, assigning fractional counts to all plausible completions.
- Hard EM: Imputes missing values by selecting a single most probable completion per case, tallying counts accordingly.
- M-step (Structure and Parameter Update):
- Structure Search: Maximizes the expected Bayesian Dirichlet score, updating by greedy hill-climbing (arc addition, deletion, reversal) using expected statistics.
- Parameter Update: Computes posterior mean or mode parameters via the Dirichlet formula:
Checks for convergence by monitoring changes in and , or upon reaching the iteration threshold.
Pseudocode (excerpted from (Ruggieri et al., 2020)):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Require: D, P(G), {α_ijk}, Soft/Hard E-step, ε, T.
Initialize G^(0), Θ^(0)
for t = 1 to T:
E-step:
if Soft EM:
Compute expected counts (belief propagation)
else if Hard EM:
Impute each record, tally counts
M-step:
Structure search: maximize expected score using counts
Parameter update: Dirichlet posterior
if converged: break
Return (G, Θ) |
3. Statistical Scoring and Expected Sufficient Statistics
For structure scoring, the expected Bayesian Dirichlet (BD) score is employed. For each node with parent configuration , the score is:
Optionally, the structure prior term is added (Ruggieri et al., 2020).
Soft EM leverages the full posterior over missing entries, typically requiring expensive global inference per record, whereas Hard EM hastens computation by focusing on the modal completion. The selection of the method may be informed by network size, missingness pattern, and severity.
4. Computational Complexity and Trade-offs
- Soft EM: Each E-step entails join-tree propagation per record (or batch), incurring cost per structure, where is the tree-width. Structure search multiplies scoring costs by the number of candidate DAGs evaluated.
- Hard EM: Each E-step computes the best completion per record, resulting in streamlined tallying post-imputation. Hard EM grows advantageous as data missingness or network dimension increases, being notably less computationally intensive.
Early-stopped Soft EM (“soft-forced EM”) reduces iterations to mimic Hard EM runtime but often remains slower and less accurate than Hard EM, particularly as missingness intensifies (Ruggieri et al., 2020).
5. Empirical Decision Criteria and Practical Guidelines
Simulation studies using benchmark BNs (Asia, Sports, Alarm, Property, Hailfinder, Formed, Pathfinder) yield empirical rules governing the preferred EM variant:
- For small/medium () networks with unbalanced or medium-high severity missingness (): Hard EM is recommended.
- If missingness distribution is balanced and severity is low (): Soft EM and Hard EM perform similarly.
- “Structured” missingness (systematic at roots, leaves, high-degree, or target nodes): Soft EM is preferred.
- For large networks () and fair missingness patterns: Hard EM dominates.
- For large, balanced, non-fair missingness: Soft EM.
Empirically, Hard EM achieved the lowest Kullback–Leibler divergence in 66% of scenarios, particularly for large networks and fair missingness. Soft EM outperformed Hard EM only under structured missingness at special nodes. Soft-forced EM did not consistently dominate in accuracy or efficiency (Ruggieri et al., 2020).
6. Convergence Guarantees and Applicability
Convergence properties are established under the conditions of a finite model class or a uniform bound on , ensuring nondecreasing improvement in the log-posterior and eventual attainment of a structural EM fixed-point, analogous to standard EM stationary points. The procedure is formally applicable to multimodal, mixture, and continuous (Gaussian) Bayesian networks, and decision trees/graphs, provided factorization and conjugacy hold (Friedman, 2013).
7. Comparative Performance and Extensions
Experimental evidence supports that Bayesian Structural EM with the exact expected score (or Laplace/quadrature approximations) outperforms BIC-based Structural EM in held-out KL divergence, especially under substantial missingness or hidden variables (up to 40% observed in ALARM and Insurance benchmarks). Fully Bayesian scoring yields better predictive models in small-sample regimes and latent-variable recovery scenarios. The algorithm efficiently exploits factorization for rapid scoring, integrates parameter and structure optimization in a unified loop, and is robust across broad factored model classes (Friedman, 2013).
In summary, the Bayesian Structural EM algorithm provides a principled, convergent, and computationally practical solution to joint structure and parameter learning in Bayesian networks with incomplete data, unifying statistical imputation and structure model selection under a true Bayesian model-selection objective. Its empirical selection criteria and demonstrable performance advantages render it a standard approach in modern probabilistic graphical model learning.