Papers
Topics
Authors
Recent
2000 character limit reached

Expectation–Conditional Maximisation Algorithm

Updated 5 December 2025
  • The ECM algorithm is a variant of EM that replaces the difficult M-step with a series of simpler conditional maximization steps, ensuring tractable updates in latent variable models.
  • In sparse Gaussian graphical model estimation, ECM employs adaptive spike-and-slab penalties to efficiently select network structures while mitigating bias in high-dimensional settings.
  • For rigid and articulated point registration, ECMPR integrates robust outlier handling and precise rotation updates via SVD or SDP, improving registration accuracy over traditional methods.

The Expectation–Conditional Maximisation (ECM) algorithm is a deterministic alternative to traditional Expectation–Maximisation (EM), designed to address maximum-likelihood and posterior-mode estimation problems with latent variables and complex conditional structure. ECM refines the classic EM paradigm by decomposing the maximization (M) step into a sequence of simpler conditional maximization (CM) sub-steps, each of which is often analytically or computationally tractable. The ECM framework is particularly salient in high-dimensional settings and latent graphical models, where full conditional maximization is infeasible, and has produced state-of-the-art methods for both sparse graphical model selection and robust mixture-based rigid/articulated point registration.

1. Algorithmic Principle and Framework

In ECM, each iteration consists of an expectation (E) step, followed by one or more CM steps. The E-step computes the expectation of the complete-data log-likelihood (or log-posterior in Bayesian settings), conditioning on the current parameter estimates and observed data. Each CM step then conditionally maximizes this expected criterion with respect to a subset (or block) of the parameters, given the current values of the other blocks. The process cycles through the parameter blocks until all are updated.

Let XX denote observed data, ZZ latent variables, and %%%%2%%%% parameter blocks. At iteration \ell:

  • E-step: Compute Q(θθ())=EZX,θ()[logp(X,Z;θ)]Q(\theta \mid \theta^{(\ell)}) = \mathbb{E}_{Z \mid X,\theta^{(\ell)}}[\log p(X,Z;\theta)]
  • For k=1,,Kk = 1,\dots,K CM-steps: θk(+1)=argmaxθkQ(θ1(+1),,θk1(+1),θk,θk+1(),,θK())\theta_k^{(\ell+1)} = \arg\max_{\theta_k} Q(\theta_1^{(\ell+1)},\ldots,\theta_{k-1}^{(\ell+1)}, \theta_k, \theta_{k+1}^{(\ell)}, \ldots, \theta_K^{(\ell)})

Standard ECM theory guarantees monotonic increase in the observed-data likelihood (or Q-function) and convergence to a stationary point.

2. ECM for Sparse Gaussian Graphical Model Estimation

The ECM Graph Selection (EMGS) algorithm of Li & McCormick applies ECM to Gaussian graphical models with a spike-and-slab prior on the precision matrix Ω\Omega (Li et al., 2017). The observed data XRn×pX \in \mathbb{R}^{n \times p} are modeled as independent samples from N(0,Ω1)\mathcal{N}(0, \Omega^{-1}). The prior is specified as follows:

  • Off-diagonal: ωjkδjkN(0,vδjk2)\omega_{jk} \mid \delta_{jk} \sim \mathcal{N}(0, v_{\delta_{jk}}^2) with "spike" (v0v_0) and "slab" (v1v_1) variances (v0v1v_0 \ll v_1)
  • Diagonal: ωjjExp(λ/2)\omega_{jj} \sim \mathrm{Exp}(\lambda/2)
  • Latent indicators: δjkBernoulli(π)\delta_{jk} \sim \mathrm{Bernoulli}(\pi), πBeta(a,b)\pi \sim \mathrm{Beta}(a,b)

The complete-data log-posterior (up to constants) is:

logp(Ω,δ,πX)=n2logΩ12tr(SΩ)j<kωjk22vδjk2λ2jωjj+j<k[δjklogπ+(1δjk)log(1π)]+(a1)logπ+(b1)log(1π)\log p(\Omega, \delta, \pi \mid X) = \frac{n}{2}\log|\Omega| - \frac{1}{2}\mathrm{tr}(S\Omega) - \sum_{j < k} \frac{\omega_{jk}^2}{2v_{\delta_{jk}}^2} - \frac{\lambda}{2} \sum_j \omega_{jj} + \sum_{j < k} [\delta_{jk} \log \pi + (1-\delta_{jk})\log(1-\pi)] + (a-1)\log \pi + (b-1)\log(1-\pi)

where S=XTXS = X^T X.

Algorithmic Steps

  1. E-step: Compute pjk=E[δjkΩ(),π(),X]p^*_{jk} = \mathbb{E}[\delta_{jk} \mid \Omega^{(\ell)}, \pi^{(\ell)}, X] and adaptive penalties djk=E[1/vδjk2]d^*_{jk} = \mathbb{E}[1/v_{\delta_{jk}}^2 \mid \cdot ]:

pjk=π()ϕ(ωjk()/v1)/v1π()ϕ(ωjk()/v1)/v1+(1π())ϕ(ωjk()/v0)/v0p^*_{jk} = \frac{\pi^{(\ell)} \phi(\omega_{jk}^{(\ell)}/v_1)/v_1} {\pi^{(\ell)} \phi(\omega_{jk}^{(\ell)}/v_1)/v_1 + (1-\pi^{(\ell)}) \phi(\omega_{jk}^{(\ell)}/v_0)/v_0}

djk=1pjkv02+pjkv12d^*_{jk} = \frac{1-p^*_{jk}}{v_0^2} + \frac{p^*_{jk}}{v_1^2}

  1. CM-step (π update): Closed form,

π(+1)=a1+j<kpjka+b2+p(p1)/2\pi^{(\ell+1)} = \frac{a-1 + \sum_{j<k} p^*_{jk}}{a + b - 2 + p(p-1)/2}

  1. CM-step (Ω update): Update by columns, analogous to block-coordinate descent:

ω12new=[(s22+λ)Ω111+Diag(dk)]1s12 ω22new=(ω12new)TΩ111ω12new+n/(s22+λ)\omega_{12}^{new} = [(s_{22}+\lambda)\Omega_{11}^{-1} + \mathrm{Diag}(d^*_{\cdot k})]^{-1}s_{12} \ \omega_{22}^{new} = (\omega_{12}^{new})^T \Omega_{11}^{-1} \omega_{12}^{new} + n/(s_{22}+\lambda)

Repeat for each k=1,,pk=1,\dots,p.

Adaptive Penalization

The mixture prior imparts elementwise adaptive ridge penalties, derived from the E-step expectation. Large ωjk|\omega_{jk}| gain weak shrinkage (closer to 1/v121/v_1^2), while small ωjk|\omega_{jk}| incur strong shrinkage (1/v021/v_0^2).

Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
initialize Ω = I_p, π = a/(a + b)
while change > tol and iter < maxit:
    # E-step
    for all j < k:
        compute p_star_jk and d_star_jk
    if X has missing:
        compute S <- E[X^T X | Ω, X_obs]
    # CM-step
    update π via Beta formula
    for k in 1...p:
        partition Ω, S; update ω_12 and ω_22
        reassemble Ω
    iter += 1

Empirically, tens of ECM iterations suffice; overall complexity per {v0}\{v_0\} grid value is comparable to a single graphical lasso solution, but without nested inner optimizations (Li et al., 2017).

3. ECM for Rigid and Articulated Point Registration

The ECM for Point Registration (ECMPR) algorithm addresses rigid and articulated point-set matching by framing unknown correspondences as missing data in a mixture-model framework (Horaud et al., 2020). Model points X={Xi}i=1n\mathcal{X} = \{X_i\}_{i=1}^n are linked to data points Y={Yj}j=1m\mathcal{Y} = \{Y_j\}_{j=1}^m via latent assignemnts Zj{1,,n,n+1}Z_j \in\{1, \dots, n, n+1\}, with n+1n+1 denoting uniform outlier.

Mixture Model Formulation

  • P(Zj=i)=pinP(Z_j = i) = p_{\text{in}} (ini \le n), =pout= p_{\text{out}} (i=n+1i = n+1)
  • P(YjZj=i)=N(Yjμ(Xi;Θ),Σi)P(Y_j \mid Z_j = i) = \mathcal{N}(Y_j \mid \mu(X_i; \Theta), \Sigma_i) for ini \le n
  • Outlier model: P(YjZj=n+1)=1/VP(Y_j \mid Z_j = n+1) = 1/V

Iterative Steps

  1. E-step: Compute soft assignments (posteriors):

αji=piN(Yjμ(Xi;Θ),Σi)k=1npkN(Yjμ(Xk;Θ),Σk)+pn+1/V\alpha_{ji} = \frac{p_i \mathcal{N}(Y_j \mid \mu(X_i; \Theta), \Sigma_i)} {\sum_{k=1}^n p_k \mathcal{N}(Y_j \mid \mu(X_k; \Theta), \Sigma_k) + p_{n+1}/V}

Outlier: αj,n+1=1i=1nαji\alpha_{j, n+1} = 1 - \sum_{i=1}^n \alpha_{ji}

  1. CM-step 1 (Registration Parameters): For μ(Xi;Θ)=RXi+t\mu(X_i;\Theta) = RX_i + t (rigid), define:

λi=j=1mαji,Wi=1λij=1mαjiYj\lambda_i = \sum_{j=1}^m \alpha_{ji}, \quad W_i = \frac{1}{\lambda_i} \sum_{j=1}^m \alpha_{ji} Y_j

Minimize:

E(R,t)=12i=1nλiWiRXitΣi2\mathcal{E}(R,t) = \frac{1}{2} \sum_{i=1}^n \lambda_i \|W_i - R X_i - t\|^2_{\Sigma_i}

  • tt has a closed-form update
  • RR solved via Procrustes/SVD for isotropic Σi\Sigma_i, or via semidefinite program for general Σi\Sigma_i
  1. CM-step 2 (Covariance Updates):

Σi=j=1mαji(Yjμ(Xi;Θ))(Yjμ(Xi;Θ))Tλi+εI\Sigma_i = \frac{\sum_{j=1}^m \alpha_{ji} (Y_j - \mu(X_i; \Theta))(Y_j - \mu(X_i; \Theta))^T} {\lambda_i} + \varepsilon I

  1. Articulated Registration: Kinematic chain is decomposed partwise, registering each rigid group incrementally.

Robustness and Outlier Handling

A uniform component in the mixture ensures that points poorly explained by any Gaussian component are classified as outliers, providing automatic robustification without the need for hand-tuned thresholds.

Pseudocode

1
2
3
4
5
6
7
8
Initialize R  I, t  0, Σ_i  large*I
repeat
    E-step: compute α_{ji} for i = 1...n+1
    form λ_i, W_i
    CM-step 1: estimate (R,t) (Procrustes or SDP)
    CM-step 2: update Σ_i via weighted covariance
until convergence
classify each j by argmax_i α_{ji}

Each ECMPR iteration costs O(mn)O(mn) for the E-step and, in the anisotropic case, requires solving a small SDP (dimension 9) for the rotation update (Horaud et al., 2020).

4. Adaptive Penalization and Structured Priors

In EMGS, latent indicators δjk\delta_{jk} select between a "spike" and "slab" prior variance, translating into strongly or weakly penalized connections in the estimated graph. This confers adaptivity and reduces bias on large interactions, a notable advance relative to uniform penalties in standard glasso.

Structured priors can be incorporated by grouping the (j,k)(j,k) edges into blocks, each block sharing a group-specific slab variance τg,g\tau_{g,g'}. These group scales can themselves be endowed with hyperpriors and updated closed-form in the CM step:

τggaτ1+12j<k1(gj,gk)=(g,g)bτ+12ωjk2djk1...\tau_{gg'} \leftarrow \frac{a_\tau - 1 + \frac{1}{2} \sum_{j<k} 1_{(g_j, g_k) = (g, g')}}{b_\tau + \frac{1}{2} \sum \omega_{jk}^2 d^*_{jk} 1_{...}}

This facilitates both flexible penalization and the infusion of external prior knowledge (Li et al., 2017).

5. Extensions: Missing Data and Mixed/Discrete Data

Both ECM-based algorithms naturally address missing data:

  • For EMGS, missing entries in XX are imputed at each E-step using the conditional expectation of XiXiTX_i X_i^T under the current Ω\Omega.
  • For ECMPR, the likelihood formulation allows unassigned (outlier) points, and unobserved correspondences are latent.

EMGS extends to mixed/binary data via a Gaussian copula approach: latent variables ZiN(0,Ω1)Z_i \sim \mathcal{N}(0, \Omega^{-1}) are truncated to be compatible with observed data ranks, and expectations use stochastic or MCEM approximations, leaving the CM-steps unchanged (Li et al., 2017).

6. Computational Complexity and Convergence Properties

  • EMGS (Graph Selection): Each full cycle of pp block updates is O(p3)O(p^3) due to efficient rank-one updates, with empirical convergence in tens of iterations. Total cost over a hyperparameter grid is competitive with glasso, but requires fewer total system solves due to deterministic convergence and lack of inner routines.
  • ECMPR (Point Registration): E-step is O(mn)O(mn), with the main computational overhead in the rotation update: SVD for isotropic covariance (O(n)O(n)), small SDP for fully anisotropic case (dimension 9). Each iteration non-decreases the log-likelihood and converges to a stationary point, as inherited from the general ECM framework (Horaud et al., 2020).
  • EMGS offers adaptive elementwise penalization, warm-started regularization paths, and flexibility for structured priors and copula-extensions, in contrast with uniform-penalty glasso and MCMC stochastic search, which become computationally intractable in high dimensions.
  • ECMPR generalizes classical EM and ICP (Iterative Closest Point) methods by supporting full covariance modeling (yielding more robust assignments) and by integrating outlier rejection natively via a uniform mixture component. Its rotation update via SDP constitutes an exact relaxation, improving over heuristic/annealing procedures.

These advances highlight ECM as a principled framework for tractable, robust inference in challenging latent-variable models, with demonstrated efficacy in both sparse graphical model selection and robust geometric registration (Li et al., 2017, Horaud et al., 2020).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Expectation--Conditional Maximisation Algorithm.