Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization (2312.04131v1)
Abstract: The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.
- Huan Zhao (109 papers)
- Li Zhang (690 papers)
- Yue Li (218 papers)
- Yannan Wang (23 papers)
- Hongji Wang (10 papers)
- Wei Rao (33 papers)
- Qing Wang (341 papers)
- Lei Xie (337 papers)