VBx for End-to-End Neural and Clustering-based Diarization (2510.19572v1)
Abstract: We present improvements to speaker diarization in the two-stage end-to-end neural diarization with vector clustering (EEND-VC) framework. The first stage employs a Conformer-based EEND model with WavLM features to infer frame-level speaker activity within short windows. The identities and counts of global speakers are then derived in the second stage by clustering speaker embeddings across windows. The focus of this work is to improve the second stage; we filter unreliable embeddings from short segments and reassign them after clustering. We also integrate the VBx clustering to improve robustness when the number of speakers is large and individual speaking durations are limited. Evaluation on a compound benchmark spanning multiple domains is conducted without fine-tuning the EEND model or tuning clustering parameters per dataset. Despite this, the system generalizes well and matches or exceeds recent state-of-the-art performance.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.