Papers
Topics
Authors
Recent
2000 character limit reached

Bayesian Gaussian Mixture Models with Wasserstein Repulsion

Updated 17 December 2025
  • Bayesian Gaussian Mixture Models are probabilistic frameworks that blend Gaussian mixtures with Bayesian priors to capture uncertainty and perform robust clustering.
  • Incorporating Wasserstein-based repulsion encourages well-separated clusters by penalizing similarity in both means and covariances, leading to improved density estimation.
  • The method utilizes a blocked-collapsed Gibbs sampler for efficient posterior inference, achieving competitive results in both simulated and real high-dimensional datasets.

A Bayesian Gaussian Mixture Model (BGMM) defines a mixture distribution over the observed data, with Gaussian components whose parameters and mixture proportions are endowed with prior distributions to capture uncertainty and support inference. In recent developments, priors have been constructed to incorporate repulsion between components, encouraging well-separated clusters by leveraging global geometric information such as the Wasserstein distance. Such approaches enable more robust density estimation, clustering, and model complexity control, particularly in high-dimensional and nonparametric regimes.

1. Model Construction: Wasserstein-Repulsive BGMM

Let y1,,ynRpy_1,\dots,y_n \in \mathbb{R}^p denote observed data. The BGMM assumes a finite mixture of KK Gaussian components with latent indicators zi{1,,K}z_i \in \{1,\dots,K\}, representing cluster assignments. The likelihood is

p(yizi=k,{m,Σ}=1K)=ϕ(yimk,Σk)p(y_i \mid z_i = k, \{m_\ell, \Sigma_\ell\}_{\ell=1}^K) = \phi(y_i \mid m_k, \Sigma_k)

where ϕ(m,Σ)\phi(\cdot \mid m, \Sigma) is the Gaussian density.

The prior on mixture proportions is Dirichlet: (w1,,wK)Dirichlet(β,,β)(w_1,\dots,w_K) \sim \mathrm{Dirichlet}(\beta,\dots,\beta) with P(zi=kw)=wkP(z_i = k \mid w) = w_k.

Standard BGMMs specify independent priors on component parameters θk=(mk,Σk)\theta_k = (m_k, \Sigma_k). In the Wasserstein-repulsive model, the joint prior is

p(θ1,,θKK)=1ZKk=1Kpm(mk)pΣ(Σk)hK(θ1,,θK)p(\theta_1,\dots,\theta_K \mid K) = \frac{1}{Z_K} \prod_{k=1}^K p_m(m_k) p_\Sigma(\Sigma_k)\, h_K(\theta_1,\dots,\theta_K)

where hKh_K introduces repulsion via

hK(θ1,,θK)=min1j<Kg(W22(N(mj,Σj),N(m,Σ)))h_K(\theta_1,\ldots,\theta_K) = \min_{1\le j<\ell\le K} g\left(W_2^2\bigl(N(m_j,\Sigma_j), N(m_\ell,\Sigma_\ell)\bigr)\right)

or its geometric mean, with g:[0,)[0,1]g: [0,\infty) \to [0,1] strictly increasing (e.g., g(x)=x/(g0+x)g(x) = x/(g_0 + x)). The term W22W_2^2 is the squared 2-Wasserstein distance between multivariate normals: W22(N(m0,Σ0),N(m1,Σ1))=m1m02+Tr(Σ0+Σ12(Σ01/2Σ1Σ01/2)1/2)W_2^2(N(m_0,\Sigma_0), N(m_1,\Sigma_1)) = \|m_1 - m_0\|^2 + \operatorname{Tr}(\Sigma_0 + \Sigma_1 - 2 (\Sigma_0^{1/2} \Sigma_1 \Sigma_0^{1/2})^{1/2}) The normalization constant ZKZ_K is intractable but is controlled (its logarithm grows at most linearly in KK) (Huang et al., 30 Apr 2025).

2. Posterior Structure and Inference

The joint posterior over Θ={θ1,,θK}\Theta = \{\theta_1,\dots,\theta_K\}, ww, and zz is

p(Θ,w,zy1:n)[i=1nwziϕ(yiθzi)]×Dir(wβ)×1ZKk=1Kpm(mk)pΣ(Σk)hK(Θ)p(\Theta, w, z \mid y_{1:n}) \propto \left[\prod_{i=1}^n w_{z_i} \phi(y_i \mid \theta_{z_i})\right] \times \mathrm{Dir}(w \mid \beta) \times \frac{1}{Z_K} \prod_{k=1}^K p_m(m_k) p_\Sigma(\Sigma_k) h_K(\Theta)

or, grouping by component kk with nk=#{i:zi=k}n_k = \#\{i : z_i = k\},

p(Θ,w,zy)k=1Kwknk+β1k=1Ki:zi=kϕ(yiθk)k=1Kpm(mk)pΣ(Σk)hK(Θ)ZKp(\Theta, w, z \mid y) \propto \prod_{k=1}^K w_k^{n_k+\beta-1} \prod_{k=1}^K \prod_{i:z_i=k} \phi(y_i \mid \theta_k) \prod_{k=1}^K p_m(m_k) p_\Sigma(\Sigma_k) \frac{h_K(\Theta)}{Z_K}

Posterior contraction is established under minimal moment and support conditions on the true density and repulsion function. Specifically, the posterior contracts in L1L_1 at rate

ϵn=(logn)tn,t>p22+p+α+24\epsilon_n = \frac{(\log n)^t}{\sqrt{n}}, \quad t > \frac{p^2}{2} + p + \frac{\alpha + 2}{4}

i.e., Π(ff01>Mϵny)0\Pi(\|f - f_0\|_1 > M\epsilon_n \mid y) \to 0 in Pf0P_{f_0}-probability, removing the simultaneous-diagonalization requirement on covariances present in mean-repulsive priors (Huang et al., 30 Apr 2025).

3. Blocked-Collapsed Gibbs Sampling

Posterior inference leverages a blocked-collapsed Gibbs sampler adapted from Neal’s augmentation and the exchangeable-partition perspective:

  • Cluster Assignments (zz):

p(zi=kzi,Θ,w,y)(ni,k+β)ϕ(yiθk)p(z_i = k \mid z_{-i}, \Theta, w, y) \propto (n_{-i,k} + \beta)\, \phi(y_i \mid \theta_k)

plus a term for potentially creating a new cluster, weighted by the prior and the repulsive function.

  • Mixture Weights (ww):

wzDirichlet(n1+β,,nK+β)w \mid z \sim \mathrm{Dirichlet}(n_1+\beta, \dots, n_K+\beta)

  • Component Parameters (θk\theta_k):

The full conditional is

p(θky(k),θk,z)pm(mk)pΣ(Σk)i:zi=kϕ(yimk,Σk)hK(θ1,,θK)p(\theta_k \mid y_{(k)}, \theta_{-k}, z) \propto p_m(m_k) p_\Sigma(\Sigma_k) \prod_{i:z_i=k} \phi(y_i \mid m_k, \Sigma_k) h_K(\theta_1,\dots,\theta_K)

As hKh_K couples all θj\theta_j, updates for θk\theta_k are performed via Metropolis–Hastings steps, using the conjugate posterior as proposal and adjusting for the change in hKh_K. Unused components are marginalized out, maintaining label-mixing and efficiency.

This MCMC approach enables practical inference while respecting the complex geometry encoded by the Wasserstein repulsion (Huang et al., 30 Apr 2025).

4. Theoretical Properties and Empirical Results

The Wasserstein-repulsive BGMM (WRGM) achieves nonparametric density estimation with rigorous contraction rates. The use of the full Wasserstein metric leads to several empirical and theoretical distinctions:

  • Empirical Evaluation:
    • In simulations with overlapping or anisotropic clusters, WRGM yields higher log-conditional predictive ordinate (log-CPO) and more accurate MAP clustering compared to mean-repulsive and mixture-of-finite-mixtures models.
    • Because repulsion is enforced in terms of the Wasserstein distance, WRGM allows smaller mean separation when covariance matrices already differ significantly, avoiding unnecessary over-separation of cluster means.
    • On real datasets (A1, GvHD), WRGM outperforms mean-repulsive and standard MFM models in predictive metrics and cluster recovery, often identifying more components but with smaller minimum pairwise mean distances—reflecting full-distribution repulsion (Huang et al., 30 Apr 2025).

5. Relation to Other Repulsive and Overfitted Mixture Priors

  • Mean-based Repulsion: The classical Bayesian Repulsive Gaussian Mixture Model (RGM) penalizes only proximity of component means (Xie et al., 2017), whereas WRGM penalizes proximity of the full Gaussian laws using the 2-Wasserstein metric, incorporating both mean and covariance structure.
  • MFM and Overfitting Control: Standard mixtures of finite mixtures and Dirichlet process mixtures can overestimate KK in the presence of only weak separation, especially as nn grows. The repulsive prior in WRGM shrinks redundant components, improving model parsimony and interpretability.
  • General BGMM Framework: The WRGM is fully compatible with the BGMM paradigm, where data, allocation variables, mixture weights, and component parameters are all equipped with conjugate or structured priors, and inference proceeds via marginal or joint data augmentation and posterior exploration (Grün et al., 7 Jul 2024, Lu, 2021).

6. Implementation Details and Practical Guidance

  • Hyperparameters: The Wasserstein-repulsive function parameter g0g_0 governs the scale of the repulsion penalty; it should be set in relation to typical inter-component distances.
  • Initialization: Efficient posterior sampling is achieved by initializing with solutions from standard EM or k-means, followed by collapsed Gibbs, with label processing to address potential switching.
  • Scalability and Complexity: The blocked-collapsed Gibbs sampler is efficient in moderate dimension; the main computational cost arises in the Metropolis–Hastings step for component-wise sampling under a coupled prior. The normalization ZKZ_K need not be computed.
  • Extension: The WRGM framework handles both diagonal and full covariances, is robust to component overlap, and provides interpretable cluster assignments and credible regions for downstream statistical analysis (Huang et al., 30 Apr 2025).

7. Impact and Future Perspectives

The introduction of Wasserstein-based repulsive priors in BGMMs extends the capacity of Bayesian nonparametrics to enforce global separation among mixture components. This approach improves clustering, density estimation, and uncertainty quantification, especially in scenarios with overlapping or heteroscedastic clusters. The method unifies flexibility, theoretical guarantees (contraction, control of KK), and computational feasibility in a single framework.

Future research directions include scalable adaptations to very high-dimensional data, further refinements of the repulsive function to accommodate mixed modalities or hierarchical structures, and rigorous assessment of model selection uncertainty under more general metric or kernel-based repulsions.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (4)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Bayesian Gaussian Mixture Models (BGMMs).