An Investigation of Distribution Alignment in Multi-Genre Speaker Recognition (2309.14158v1)
Abstract: Multi-genre speaker recognition is becoming increasingly popular due to its ability to better represent the complexities of real-world applications. However, a major challenge is the significant shift in the distribution of speaker vectors across different genres. While distribution alignment is a common approach to address this challenge, previous studies have mainly focused on aligning a source domain with a target domain, and the performance of multi-genre data is unknown. This paper presents a comprehensive study of mainstream distribution alignment methods on multi-genre data, where multiple distributions need to be aligned. We analyze various methods both qualitatively and quantitatively. Our experiments on the CN-Celeb dataset show that within-between distribution alignment (WBDA) performs relatively better. However, we also found that none of the investigated methods consistently improved performance in all test cases. This suggests that solely aligning the distributions of speaker vectors may not fully address the challenges posed by multi-genre speaker recognition. Further investigation is necessary to develop a more comprehensive solution.
- “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP. IEEE, 2018, pp. 5329–5333.
- “Speaker recognition for multi-speaker conversations using x-vectors,” in ICASSP. IEEE, 2019, pp. 5796–5800.
- “X-vector DNN refinement with full-length recordings for speaker recognition.,” in INTERSPEECH, 2019, pp. 1493–1496.
- “The 2021 NIST speaker recognition evaluation,” arXiv preprint arXiv:2204.10242, 2022.
- “Voxsrc 2022: The fourth voxceleb speaker recognition challenge,” arXiv preprint arXiv:2302.10248, 2023.
- “CN-Celeb: multi-genre speaker recognition,” Speech Communication, vol. 137, pp. 77–91, 2022.
- “Return of frustratingly easy domain adaptation,” in AAAI, 2016, vol. 30.
- “The CORAL+ algorithm for unsupervised domain adaptation of PLDA,” in ICASSP. IEEE, 2019, pp. 5821–5825.
- “The CORAL++ algorithm for unsupervised domain adaptation of speaker recognition,” in ICASSP. IEEE, 2022, pp. 7172–7176.
- “Deep coral: Correlation alignment for deep domain adaptation,” in Computer Vision–ECCV 2016 Workshops. Springer, 2016, pp. 443–450.
- “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
- “Multi-level deep neural network adaptation for speaker verification using mmd and consistency regularization,” in ICASSP. IEEE, 2020, pp. 6839–6843.
- “Multi-source domain adaptation for text-independent forensic speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 60–75, 2021.
- “Class-aware distribution alignment based unsupervised domain adaptation for speaker verification,” in INTERSPEECH, 2022, pp. 3689–3693.
- “A discriminative feature learning approach for deep face recognition,” in Computer Vision–ECCV 2016. Springer, 2016, pp. 499–515.
- “Deep discriminative embeddings for duration robust speaker verification.,” in INTERSPEECH, 2018, pp. 2262–2266.
- “CN-Celeb: a challenging Chinese speaker recognition dataset,” in ICASSP. IEEE, 2020, pp. 7604–7608.
- “A kernel method for the two-sample-problem,” Advances in neural information processing systems, vol. 19, 2006.
- “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning. PMLR, 2015, pp. 1180–1189.
- “Unsupervised domain adaptation via domain adversarial training for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4889–4893.
- “Adversarial training for multi-domain speaker recognition,” in ISCSLP. IEEE, 2021, pp. 1–5.
- “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- “Squeeze-and-excitation networks,” in CVPR, 2018, pp. 7132–7141.
- “Attentive statistics pooling for deep speaker embedding,” in INTERSPEECH, 2018, pp. 2252–2256.
- “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in APSIPA ASC. IEEE, 2019, pp. 1652–1656.
- “Self-supervised learning based domain adaptation for robust speaker verification,” in ICASSP. IEEE, 2021, pp. 5834–5838.
- “Wespeaker: A research and production oriented speaker embedding learning toolkit,” arXiv preprint arXiv:2210.17016, 2022.
- Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.