Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization (2312.04131v1)

Published 7 Dec 2023 in eess.AS and cs.SD

Abstract: The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Huan Zhao (109 papers)
  2. Li Zhang (690 papers)
  3. Yue Li (218 papers)
  4. Yannan Wang (23 papers)
  5. Hongji Wang (10 papers)
  6. Wei Rao (33 papers)
  7. Qing Wang (341 papers)
  8. Lei Xie (337 papers)