Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition? (2204.12765v2)
Abstract: Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.
- Sanyuan Chen (28 papers)
- Yu Wu (196 papers)
- Chengyi Wang (32 papers)
- Shujie Liu (101 papers)
- Zhuo Chen (319 papers)
- Peidong Wang (33 papers)
- Gang Liu (177 papers)
- Jinyu Li (164 papers)
- Jian Wu (314 papers)
- Xiangzhan Yu (7 papers)
- Furu Wei (291 papers)