Effect of single-speaker x-vector training on overlapped-speech diarization

Determine whether training speaker-embedding extractors such as x-vectors exclusively on single-speaker utterances degrades embedding quality in overlapping regions and increases speaker confusion in overlap-aware diarization, and quantify whether training on overlapping or noise-augmented data mitigates this effect.

Background

In overlap-aware spectral clustering on AMI, missed speech decreased but speaker confusion increased. The authors hypothesize this is due to a mismatch between x-vectors trained on single-speaker utterances and overlapped test regions. They show partial improvement with noise-augmented x-vectors and visualize overlap embeddings with t-SNE, suggesting noisy overlap representations may drive confusion.

Validating or refuting this conjecture would clarify whether embedding training conditions materially affect overlap diarization errors and guide training data design for robust x-vectors.

References

We conjecture that since the x-vector extractor was trained only on single-speaker utterances, a mismatch in the overlap regions of the recording results in noisy samples.

— Listening to Multi-talker Conversations: Modular and End-to-end Perspectives (2402.08932 - Raj, 14 Feb 2024) in Chapter 2 (Overlap-aware Speaker Diarization), Section “Diarization results for AMI” (Section 2.7)

Effect of single-speaker x-vector training on overlapped-speech diarization

Sponsor

Background

References

Related Problems