Bayesian Gaussian Mixture Models with Wasserstein Repulsion

Updated 17 December 2025

Bayesian Gaussian Mixture Models are probabilistic frameworks that blend Gaussian mixtures with Bayesian priors to capture uncertainty and perform robust clustering.
Incorporating Wasserstein-based repulsion encourages well-separated clusters by penalizing similarity in both means and covariances, leading to improved density estimation.
The method utilizes a blocked-collapsed Gibbs sampler for efficient posterior inference, achieving competitive results in both simulated and real high-dimensional datasets.

A Bayesian Gaussian Mixture Model (BGMM) defines a mixture distribution over the observed data, with Gaussian components whose parameters and mixture proportions are endowed with prior distributions to capture uncertainty and support inference. In recent developments, priors have been constructed to incorporate repulsion between components, encouraging well-separated clusters by leveraging global geometric information such as the Wasserstein distance. Such approaches enable more robust density estimation, clustering, and model complexity control, particularly in high-dimensional and nonparametric regimes.

1. Model Construction: Wasserstein-Repulsive BGMM

Let $y_1,\dots,y_n \in \mathbb{R}^p$ denote observed data. The BGMM assumes a finite mixture of $K$ Gaussian components with latent indicators $z_i \in \{1,\dots,K\}$ , representing cluster assignments. The likelihood is

$p(y_i \mid z_i = k, \{m_\ell, \Sigma_\ell\}_{\ell=1}^K) = \phi(y_i \mid m_k, \Sigma_k)$

where $\phi(\cdot \mid m, \Sigma)$ is the Gaussian density.

The prior on mixture proportions is Dirichlet: $(w_1,\dots,w_K) \sim \mathrm{Dirichlet}(\beta,\dots,\beta)$ with $P(z_i = k \mid w) = w_k$ .

Standard BGMMs specify independent priors on component parameters $\theta_k = (m_k, \Sigma_k)$ . In the Wasserstein-repulsive model, the joint prior is

$p(\theta_1,\dots,\theta_K \mid K) = \frac{1}{Z_K} \prod_{k=1}^K p_m(m_k) p_\Sigma(\Sigma_k)\, h_K(\theta_1,\dots,\theta_K)$

where $h_K$ introduces repulsion via

$h_K(\theta_1,\ldots,\theta_K) = \min_{1\le j<\ell\le K} g\left(W_2^2\bigl(N(m_j,\Sigma_j), N(m_\ell,\Sigma_\ell)\bigr)\right)$

or its geometric mean, with $g: [0,\infty) \to [0,1]$ strictly increasing (e.g., $g(x) = x/(g_0 + x)$ ). The term $W_2^2$ is the squared 2-Wasserstein distance between multivariate normals: $W_2^2(N(m_0,\Sigma_0), N(m_1,\Sigma_1)) = \|m_1 - m_0\|^2 + \operatorname{Tr}(\Sigma_0 + \Sigma_1 - 2 (\Sigma_0^{1/2} \Sigma_1 \Sigma_0^{1/2})^{1/2})$ The normalization constant $Z_K$ is intractable but is controlled (its logarithm grows at most linearly in $K$ ) (Huang et al., 30 Apr 2025).

2. Posterior Structure and Inference

The joint posterior over $\Theta = \{\theta_1,\dots,\theta_K\}$ , $w$ , and $z$ is

$p(\Theta, w, z \mid y_{1:n}) \propto \left[\prod_{i=1}^n w_{z_i} \phi(y_i \mid \theta_{z_i})\right] \times \mathrm{Dir}(w \mid \beta) \times \frac{1}{Z_K} \prod_{k=1}^K p_m(m_k) p_\Sigma(\Sigma_k) h_K(\Theta)$

or, grouping by component $k$ with $n_k = \#\{i : z_i = k\}$ ,

$p(\Theta, w, z \mid y) \propto \prod_{k=1}^K w_k^{n_k+\beta-1} \prod_{k=1}^K \prod_{i:z_i=k} \phi(y_i \mid \theta_k) \prod_{k=1}^K p_m(m_k) p_\Sigma(\Sigma_k) \frac{h_K(\Theta)}{Z_K}$

Posterior contraction is established under minimal moment and support conditions on the true density and repulsion function. Specifically, the posterior contracts in $L_1$ at rate

$\epsilon_n = \frac{(\log n)^t}{\sqrt{n}}, \quad t > \frac{p^2}{2} + p + \frac{\alpha + 2}{4}$

i.e., $\Pi(\|f - f_0\|_1 > M\epsilon_n \mid y) \to 0$ in $P_{f_0}$ -probability, removing the simultaneous-diagonalization requirement on covariances present in mean-repulsive priors (Huang et al., 30 Apr 2025).

3. Blocked-Collapsed Gibbs Sampling

Posterior inference leverages a blocked-collapsed Gibbs sampler adapted from Neal’s augmentation and the exchangeable-partition perspective:

Cluster Assignments ( $z$ ):

$p(z_i = k \mid z_{-i}, \Theta, w, y) \propto (n_{-i,k} + \beta)\, \phi(y_i \mid \theta_k)$

plus a term for potentially creating a new cluster, weighted by the prior and the repulsive function.

Mixture Weights ( $w$ ):

$w \mid z \sim \mathrm{Dirichlet}(n_1+\beta, \dots, n_K+\beta)$

Component Parameters ( $\theta_k$ ):

The full conditional is

$p(\theta_k \mid y_{(k)}, \theta_{-k}, z) \propto p_m(m_k) p_\Sigma(\Sigma_k) \prod_{i:z_i=k} \phi(y_i \mid m_k, \Sigma_k) h_K(\theta_1,\dots,\theta_K)$

As $h_K$ couples all $\theta_j$ , updates for $\theta_k$ are performed via Metropolis–Hastings steps, using the conjugate posterior as proposal and adjusting for the change in $h_K$ . Unused components are marginalized out, maintaining label-mixing and efficiency.

This MCMC approach enables practical inference while respecting the complex geometry encoded by the Wasserstein repulsion (Huang et al., 30 Apr 2025).

4. Theoretical Properties and Empirical Results

The Wasserstein-repulsive BGMM (WRGM) achieves nonparametric density estimation with rigorous contraction rates. The use of the full Wasserstein metric leads to several empirical and theoretical distinctions:

Empirical Evaluation:
- In simulations with overlapping or anisotropic clusters, WRGM yields higher log-conditional predictive ordinate (log-CPO) and more accurate MAP clustering compared to mean-repulsive and mixture-of-finite-mixtures models.
- Because repulsion is enforced in terms of the Wasserstein distance, WRGM allows smaller mean separation when covariance matrices already differ significantly, avoiding unnecessary over-separation of cluster means.
- On real datasets (A1, GvHD), WRGM outperforms mean-repulsive and standard MFM models in predictive metrics and cluster recovery, often identifying more components but with smaller minimum pairwise mean distances—reflecting full-distribution repulsion (Huang et al., 30 Apr 2025).

5. Relation to Other Repulsive and Overfitted Mixture Priors

Mean-based Repulsion: The classical Bayesian Repulsive Gaussian Mixture Model (RGM) penalizes only proximity of component means (Xie et al., 2017), whereas WRGM penalizes proximity of the full Gaussian laws using the 2-Wasserstein metric, incorporating both mean and covariance structure.
MFM and Overfitting Control: Standard mixtures of finite mixtures and Dirichlet process mixtures can overestimate $K$ in the presence of only weak separation, especially as $n$ grows. The repulsive prior in WRGM shrinks redundant components, improving model parsimony and interpretability.
General BGMM Framework: The WRGM is fully compatible with the BGMM paradigm, where data, allocation variables, mixture weights, and component parameters are all equipped with conjugate or structured priors, and inference proceeds via marginal or joint data augmentation and posterior exploration (Grün et al., 7 Jul 2024, Lu, 2021).

6. Implementation Details and Practical Guidance

Hyperparameters: The Wasserstein-repulsive function parameter $g_0$ governs the scale of the repulsion penalty; it should be set in relation to typical inter-component distances.
Initialization: Efficient posterior sampling is achieved by initializing with solutions from standard EM or k-means, followed by collapsed Gibbs, with label processing to address potential switching.
Scalability and Complexity: The blocked-collapsed Gibbs sampler is efficient in moderate dimension; the main computational cost arises in the Metropolis–Hastings step for component-wise sampling under a coupled prior. The normalization $Z_K$ need not be computed.
Extension: The WRGM framework handles both diagonal and full covariances, is robust to component overlap, and provides interpretable cluster assignments and credible regions for downstream statistical analysis (Huang et al., 30 Apr 2025).

7. Impact and Future Perspectives

The introduction of Wasserstein-based repulsive priors in BGMMs extends the capacity of Bayesian nonparametrics to enforce global separation among mixture components. This approach improves clustering, density estimation, and uncertainty quantification, especially in scenarios with overlapping or heteroscedastic clusters. The method unifies flexibility, theoretical guarantees (contraction, control of $K$ ), and computational feasibility in a single framework.

Future research directions include scalable adaptations to very high-dimensional data, further refinements of the repulsive function to accommodate mixed modalities or hierarchical structures, and rigorous assessment of model selection uncertainty under more general metric or kernel-based repulsions.

References:

Bayesian Wasserstein Repulsive Gaussian Mixture Models (Huang et al., 30 Apr 2025)
Bayesian Repulsive Gaussian Mixture Model (Xie et al., 2017)
Bayesian Finite Mixture Models (Grün et al., 7 Jul 2024)

PDF Markdown Chat (Pro)

References (4)

Bayesian Wasserstein Repulsive Gaussian Mixture Models (2025)

Bayesian Repulsive Gaussian Mixture Model (2017)

Bayesian Finite Mixture Models (2024)

A survey on Bayesian inference for Gaussian mixture model (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bayesian Gaussian Mixture Models (BGMMs).