Effect of single-speaker x-vector training on overlapped-speech diarization
Determine whether training speaker-embedding extractors such as x-vectors exclusively on single-speaker utterances degrades embedding quality in overlapping regions and increases speaker confusion in overlap-aware diarization, and quantify whether training on overlapping or noise-augmented data mitigates this effect.
References
We conjecture that since the x-vector extractor was trained only on single-speaker utterances, a mismatch in the overlap regions of the recording results in noisy samples.
— Listening to Multi-talker Conversations: Modular and End-to-end Perspectives
(2402.08932 - Raj, 14 Feb 2024) in Chapter 2 (Overlap-aware Speaker Diarization), Section “Diarization results for AMI” (Section 2.7)