EEND-VC: Neural Diarization with Vector Clustering
- The paper introduces a hybrid system that fuses local frame-wise neural predictions with constrained global clustering to resolve label permutation and scale to long recordings.
- It leverages advanced architectures, using models like finetuned WavLM with Conformer decoders, to robustly handle overlapping speech and improve accuracy.
- The method applies techniques such as cAHC and iGMM to ensure consistent speaker labeling even in noisy, multi-speaker environments.
End-to-End Neural Diarization with Vector Clustering (EEND-VC) describes a family of hybrid diarization systems that combines the direct modeling and overlap handling capabilities of end-to-end neural approaches with the robust global speaker labeling provided by clustering speaker embeddings across relatively short chunks of long audio. The pipeline is characterized by a two-stage design: (1) a neural network operates on fixed-length audio segments to predict local per-frame, per-speaker activity as well as speaker embeddings; (2) a constrained clustering algorithm aligns and groups these embeddings globally, resolving the label permutation problem and enabling scalable diarization over arbitrarily long and multi-speaker recordings.
1. The EEND-VC Framework: Overview and Motivation
Traditional speaker diarization typically follows a modular approach involving voice activity detection, speaker embedding extraction, and offline clustering (e.g., agglomerative hierarchical clustering, spectral clustering, or Bayesian HMMs such as VBx) of embeddings (e.g., i-vectors or x-vectors) to produce global speaker labels. These methods are limited in their ability to handle overlapped speech and are not optimized end-to-end for diarization errors.
End-to-end neural diarization (EEND) reformulates diarization as multi-label frame-wise classification, directly predicting a speaker activity matrix for each frame and each speaker via deep architectures such as BLSTM, Transformer, or Conformer. EEND natively models overlapping speech and enables direct optimization for diarization error. However, EEND faces challenges scaling to long recordings and arbitrary numbers of speakers, in part due to computational constraints and the label ambiguity across processing blocks.
EEND-VC (Kinoshita et al., 2020, Kinoshita et al., 2021) addresses these limitations by:
- Splitting the input audio into chunks/windows for local EEND inference.
- Predicting per-chunk speaker activity and extracting local speaker embeddings.
- Globally clustering these embeddings using constrained clustering (cannot-link), producing consistent speaker labels across all chunks.
- Allowing arbitrary total speaker count even when each chunk is limited to S_local simultaneous speakers.
This approach combines the overlap-aware, end-to-end modeling of EEND with the robustness and scalability of clustering.
2. Neural Segmenter Models and Local Predictions
EEND-VC’s first stage is a neural model that, for each chunk of audio, predicts frame-wise speaker activities and extracts a pool of chunk-level speaker embeddings. Variants have been built using self-attention, Conformer, BLSTM, or newer decoders (e.g., Mamba) (Plaquet et al., 13 Jun 2025, Liu et al., 2021). Recent work consistently reports the best performance with a finetuned WavLM encoder followed by a Conformer decoder (Plaquet et al., 13 Jun 2025, Pálka et al., 22 Oct 2025), especially for long chunks (e.g., ≥10–30 seconds).
The network is trained using a permutation-invariant binary cross-entropy loss for multi-label classification, and a loss encouraging embeddings of the same speaker (across chunks) to be close while remaining distant from other speakers (Kinoshita et al., 2020, Kinoshita et al., 2021):
where is computed via PIT-BCE (permutation invariance over local outputs) and enforces embedding discriminability. With more powerful encoders (e.g., finetuned WavLM), the multilabel loss remains optimal; otherwise, multiclass powerset loss can speed convergence and boost performance except in the highest-performing configurations (Plaquet et al., 13 Jun 2025).
3. Global Clustering: From Constrained AHC to Generative Models
After local inference, EEND-VC uses clustering to associate local speaker embeddings into global speaker tracks.
- Constrained clustering: Constrained agglomerative hierarchical clustering (cAHC), COP-Kmeans, or spectral clustering are employed with “cannot-link” constraints to ensure that embeddings from the same chunk are assigned to different clusters (Kinoshita et al., 2021, Pálka et al., 22 Oct 2025).
- Speaker reassignment: Global label reassignment is performed post-clustering using the Hungarian algorithm to guarantee a bijective mapping between local and global labels within each chunk/window (Pálka et al., 22 Oct 2025).
- Robustness and post-filtering: Filtering out embeddings from short, unreliable segments before clustering improves global assignment reliability, especially in overlapped or sparse conditions (Pálka et al., 22 Oct 2025).
Recent advances integrate more powerful generative clustering, such as the infinite GMM (iGMM) (Kinoshita et al., 2022), or multi-stream variants of VBx (Delcroix et al., 2023, Pálka et al., 22 Oct 2025). These models allow nonparametric estimation of speaker count and direct end-to-end differentiable clustering (via unfolded EM steps and ARI-based losses), further reducing speaker confusion error.
The table below summarizes common clustering choices:
| Clustering Algorithm | Key Constraint | Typical Strength |
|---|---|---|
| cAHC / COP-Kmeans | Cannot-link (per chunk) | Simplicity, label consistency |
| Spectral clustering | Cannot-link mask | Nonlinear cluster modeling |
| iGMM (deep-unfolded) | Differentiable, DP prior | Trainable, nonparametric speaker num. |
| Multi-stream VBx (MS-VBx) | HMM/GMM, multi-stream | Robust in overlapping/many-spkr cases |
4. System Enhancements and Performance Analysis
EEND-VC’s accuracy and generalization depend critically on its architectural and algorithmic choices:
- Encoder choice: Finetuned WavLM encoders consistently outperform alternatives; SincNet-based systems lag significantly (Plaquet et al., 13 Jun 2025).
- Decoder: Conformer decoders, with self-attention and convolution, better exploit long-range context than LSTM and are more robust on long chunks.
- Chunk/window length: Longer chunks offer more information for speaker representation but increase local segmentation errors unless modern architectures (Conformer, Mamba) are used (Plaquet et al., 13 Jun 2025). The optimal chunk size balances embedding quality against local prediction accuracy.
- Filtering unreliable embeddings: Removing local speakers with insufficient active speech improves global label assignment.
- Clustering loss and tight integration: Optimizing a continuous Adjusted Rand Index (ARI) loss during clustering further lowers speaker confusion (Kinoshita et al., 2022).
- Global label mapping: Constrained assignment via the Hungarian algorithm ensures permutation consistency and prevents local duplications (Pálka et al., 22 Oct 2025).
On compound benchmarks across diverse datasets (meeting, conversational, in-the-wild), EEND-VC equipped with these enhancements achieves or exceeds state-of-the-art performance in DER and speaker counting error without need for dataset- or domain-specific tuning (Pálka et al., 22 Oct 2025). For example, typical DER improvements relative to conventional EEND and earlier clustering-based pipelines are in the range of 10–30% on challenging long-form or highly overlapped data (Kinoshita et al., 2020, Delcroix et al., 2023, Kinoshita et al., 2021).
5. Streaming, Online, and Multi-channel Extensions
EEND-VC is natively suited for offline long-form diarization, but recent adaptations enable low-latency and multi-channel operation:
- Online streaming: Speaker-tracing buffers and block-wise buffer updates allow chunked processing with consistent global labeling and low latency, using novel buffer frame selection strategies (e.g., weighted KLD) and explicit zero-padding (Xue et al., 2021, Horiguchi et al., 2022).
- Multi-channel, multi-domain integration: Per-channel EEND-VC outputs can be integrated using late fusion techniques such as DOVER-LAP, robustifying against channel-specific errors; channel-adaptive and self-supervised adaptation (SSA) with pseudo-labels further improves generalization (Tawara et al., 2023).
- Robustness to many speakers and short segments: VBx/GMM-based clustering is particularly effective in conditions with many speakers (≥8) or when per-speaker speech is brief, an area where simple constrained clustering struggles (Pálka et al., 22 Oct 2025).
- Simulated conversations and training datasets: Generating synthetic conversations that mimic real conversational turn-taking and overlap statistics significantly improves EEND(-VC) network generalization and reduces reliance on fine-tuning (Landini et al., 2022, Landini, 27 Jun 2024).
6. Limitations, Open Challenges, and Comparisons
- Training data requirements: EEND-VC, like other EEND variants, requires large volumes of annotated diarization data for optimal performance. Its accuracy degrades when training data is scarce (Serafini et al., 2023, Landini, 27 Jun 2024).
- Chunk size vs. system latency: While longer chunks can enhance global consistency and embedding robustness, they also increase memory use, latency, and the risk of exceeding the output speaker capacity per chunk (Plaquet et al., 13 Jun 2025).
- Noisy/short segments: Embeddings extracted from very short (or entirely overlapping) segments are unreliable; effective filtering is essential for robust clustering (Pálka et al., 22 Oct 2025).
- Scalability to very high speaker counts or extreme overlap: On highly adversarial cases (e.g., AMI, AliMeeting, VoxConverse), system performance is sensitive to the design of the clustering and reassignment algorithms and may still lag highly tuned separation-guided systems (Härkönen et al., 23 Jan 2024).
Comparison with alternative neural diarization strategies:
| Approach | Overlap Handling | Scalability to Long Recordings | Explicit Clustering | Key Limitation |
|---|---|---|---|---|
| SA-EEND/EEND (1-pass) | Yes | No (memory) | No | Label permutation, no global context |
| EEND-VC | Yes | Yes | Yes | Data-hungry, sensitive to chunk size |
| EEND-GLA | Yes | Yes (lower latency) | Yes (local attractor clustering) | Relies on short blocks, more frequent cluster merging |
| End-to-end-M2F, DiaPer | Yes | Limited by GPU/memory | No | No global clustering, less scalable |
7. Future Directions
Active research fronts for EEND-VC and related diarization systems include:
- End-to-end trainable clustering: Deep-unfolded iGMM and ARI-based loss for tight neural-clustering integration (Kinoshita et al., 2022).
- Embedding-free global aggregation: Direct use of frame-wise neural scores for global alignment without explicit embedding computation (Li et al., 26 Jun 2024).
- Novel decoder architectures: Mamba and Perceiver (DiaPer) blocks, yielding faster, more robust attractor generation and improved overlap detection (Plaquet et al., 13 Jun 2025, Landini, 27 Jun 2024).
- Unified and multilingual models: Training with broader domain coverage and simulated conversations for improved transfer and domain adaptation.
- Efficient inference: Batching, frame subset selection, and adaptive windowing for large-scale or streaming applications (Li et al., 26 Jun 2024).
- Real-time and multi-microphone fusion: Further reductions in latency and improved spatial robustness, leveraging late fusion and self-supervised adaptation with pseudo-labels (Tawara et al., 2023).
This synthesis encompasses the critical algorithmic details and recent empirical advances that position EEND-VC as the de facto backbone of modern, scalable neural speaker diarization pipelines—balancing end-to-end optimization, overlap handling, and robust global speaker assignment via constrained clustering strategies (Pálka et al., 22 Oct 2025, Plaquet et al., 13 Jun 2025).