VBx Clustering in Speaker Diarization

Updated 24 October 2025

VBx clustering is a Bayesian framework that models sequential speaker embeddings using HMMs and PLDA, effectively addressing diarization challenges.
It integrates with end-to-end neural diarization pipelines by filtering unreliable embeddings and employing constrained reassignment for improved accuracy.
Extensions like MS-VBx enhance multi-stream and overlap-aware diarization, reducing error rates and refining speaker count estimation in complex scenarios.

VBx clustering refers to a family of methods built upon Variational Bayesian Hidden Markov Models (VB-HMM), widely used in speaker diarization, and recently extended to multi-stream and end-to-end neural diarization paradigms. The VBx approach models the generative process of sequence embeddings—most commonly x-vectors extracted from speech—using Bayesian inference, PLDA speaker distribution modeling, and probabilistic assignment to speaker clusters. While initially developed for classic cluster-based diarization, VBx has been adapted for two-stage EEND-VC pipelines to improve robustness, accuracy, and generalization in scenarios with many speakers and short speaking durations (Pálka et al., 22 Oct 2025, Landini et al., 2020, Delcroix et al., 2023, Serafini et al., 2023).

1. Theoretical Foundation and Probabilistic Model

The central model in VBx clustering is a Bayesian HMM in which each hidden state corresponds to a speaker, and the observed sequence consists of x-vectors or speaker embeddings. Speaker-specific emission distributions are parameterized by a PLDA model, where each speaker’s latent vector is distributed according to a prior (usually standard Gaussian), and the generative process of the observed embedding $\mathbf{x}_t$ is modeled by:

$p(\mathbf{x}_t | z_t = s, \mathbf{y}_s) = \mathcal{N}(\mathbf{x}_t; V\mathbf{y}_s, \Sigma)$

where $V$ is a PLDA transformation matrix, $\mathbf{y}_s$ is the latent eigenvoice vector, and $\Sigma$ is within-speaker covariance (Landini et al., 2020).

Transition probabilities in the HMM (with self-loop parameter $P_{\text{loop}}$ ) model speaker turn durations. The full generative model incorporates priors over speaker latent variables, an ergodic HMM topology, and explicit transition densities.

Variational Bayesian inference is used, with joint posteriors over speaker assignments $z_t$ and speaker representations $\mathbf{y}_s$ . Approximate posteriors:

$q^*(\mathbf{y}_s) = \mathcal{N}(\alpha_s, L_s^{-1})$ , $\alpha_s$ and $L_s$ are iteratively updated via closed-form equations.
Frame-level responsibilities $\gamma_{ts}$ are computed via a forward-backward algorithm.
Speaker priors $\pi_s$ are updated in a maximum-likelihood type-II fashion.

This probabilistic treatment supports automatic relevance determination, allowing the model to prune redundant speaker states and adapt cluster counts (Landini et al., 2020).

2. Integration with Two-Stage End-to-End Neural Diarization (EEND-VC)

In recent diarization systems, VBx has been integrated into two-stage EEND-VC pipelines to improve robustness in high-speaker-count and short-duration scenarios (Pálka et al., 22 Oct 2025, Delcroix et al., 2023). The pipeline consists of:

Stage 1: Conformer-based EEND model with WavLM features infers local, frame-level speaker activity in overlapping windows (speech regions are chunked).
Stage 2: Speaker embeddings are extracted per local speaker track; clustering is applied across windows to resolve global speaker identities.

Traditional systems rely on agglomerative hierarchical clustering (AHC) for global speaker label assignment. VBx replaces or supplements AHC, providing robustness to large numbers of speakers and minimizing spurious cluster formation due to short, possibly low-quality speech segments. When adapting VBx to this framework, two modifications are crucial:

The sequential assumption of HMM is removed by setting $P_{\text{loop}} = 0$ , reducing the model to a Gaussian Mixture Model (GMM) for non-contiguous embeddings.
Reliability filtering: Only sufficiently long, non-overlapping segments are initially used for centroid estimation; unreliable short-segment embeddings are later reassigned via constrained assignment (Hungarian matching) (Pálka et al., 22 Oct 2025).

This “degraded” VBx variant offers improved cluster stability and speaker count estimation, especially when combined with embedding filtering and constrained reassignment.

3. Multi-Stream and Overlap-Aware VBx Extensions

In overlap-intensive, multi-speaker diarization (e.g., meeting, telephone conversations), EEND-VC systems generate multiple speaker embedding streams per chunk. MS-VBx extends the classic VBx by:

Allowing HMM states to represent sets of speakers (multi-stream modeling).
Tying Gaussian parameters across HMM sub-states for the same global speaker.
Enforcing "cannot-link" constraints to prevent multiple embeddings from the same chunk being assigned to the same cluster.

Joint inference is performed over multi-stream sequence data with tied latent speaker variables and priors, extending the standard forward–backward and VB update equations (Delcroix et al., 2023). MS-VBx has demonstrated reduced diarization error rates and improved speaker counting compared to cAHC, leveraging the Bayesian framework’s capacity for representing cluster uncertainty in complex, overlapping speech mixtures.

4. Computational Considerations and Practical Performance

VBx clustering is moderately computationally intensive due to iterative variational updates, forward–backward recursions, and EM-style updates. For classic clustering-based diarization in CTS scenarios:

Diarization Error Rate (DER): ~11.6%
Real-Time Factor (RTF): ≈0.13
RAM: ~627 MB (Serafini et al., 2023)

In EEND-VC pipelines with VBx, comparable or lower DERs are achieved, matching or exceeding recent state-of-the-art systems. For example, DER ≈ 13.5% was reported on compound benchmarks without dataset-specific fine-tuning (Pálka et al., 22 Oct 2025). By filtering unreliable embeddings and employing constrained reassignment, precision in speaker identity mapping is further improved.

End-to-end methods such as EEND-VC (with VBx or MS-VBx clustering) are faster at inference and more robust in high-overlap scenarios, though the consistency and generalization of VBx remains a strong advantage.

5. Comparison with Alternative Diarization Approaches

AHC + PLDA: Classical cluster assignment is sensitive to initial label assignment and spurious short-segment embeddings; it does not benefit from Bayesian relevance determination. VBx supersedes AHC by refining initial clusters with soft VB assignments and prior pruning.
EEND-VC (DNN-based + cAHC): EEND-VC with cAHC demonstrates strong accuracy, but MS-VBx clustering outperforms cAHC, reducing DER and mean speaker counting error (e.g., 11.1% → 10.4%, ME: 1.2 → 0.6 on CALLHOME) (Delcroix et al., 2023).
Self-Attentive EEND and SSGD: These end-to-end systems offer efficiency and direct overlap modeling, but the performance may degrade on sparse long conversations and they require substantial annotated data. VBx exhibits more consistent accuracy across datasets and domains (Serafini et al., 2023).

6. Domain Adaptation and Generalization

VBx clustering has demonstrated robustness without fine-tuning not only across conversational telephone benchmarks (CALLHOME, Fisher, CallCntrITA, CallCntrPOR), but also in complex meeting, news, and multi-lingual datasets (AMI, AISHELL-4, AliMeeting, DIHARD3, MSDWild, VoxConverse) (Pálka et al., 22 Oct 2025). The system generalizes satisfactorily when filtering, GMM reduction, and constrained assignment are properly configured, suggesting that Bayesian clustering frameworks—though moderately more computationally demanding than pure DNN approaches—are advantageous for diverse, real-world diarization scenarios where speaker durations and speaking styles are highly variable.

7. Limitations and Implementation Challenges

VBx (in its HMM form) is not overlap-aware; methods such as MS-VBx or mixture-model extensions are required for overlap-rich diarization.
Performance is dependent on the quality and reliability of input embeddings—short or overlapped segments can degrade clustering precision unless filtered or post-processed.
Model simplification (e.g., GMM reduction by setting $P_{\text{loop}} = 0$ ) may reduce some benefits of sequence modeling but is necessary for EEND-VC style non-contiguous input.
Initialization (e.g., over-clustering via AHC) remains a significant factor; if the initial cluster assignments are erroneous, subsequent VBx refinement is limited.
Full system performance and resource efficiency hinge on careful implementation of filtering, constrained assignment, and backbone neural embedding extractors (such as ResNet101 or Conformer with WavLM).

VBx clustering represents a probabilistically principled approach for sequence embedding diarization, now adapted for neural, multi-stream, and high-speaker-count settings. Its Bayesian relevance determination, integration into EEND-VC pipelines, and demonstrated consistency in performance mark its significance in the evolution of modern diarization methodologies (Pálka et al., 22 Oct 2025, Landini et al., 2020, Delcroix et al., 2023, Serafini et al., 2023).