Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification (2309.14149v1)
Abstract: In real-world applications, speaker recognition models often face various domain-mismatch challenges, leading to a significant drop in performance. Although numerous domain adaptation techniques have been developed to address this issue, almost all present methods focus on a simple configuration where the model is trained in one domain and deployed in another. However, real-world environments are often complex and may contain multiple domains, making the methods designed for one-to-one adaptation suboptimal. In our paper, we propose a self-supervised learning method to tackle this multi-domain adaptation problem. Building upon the basic self-supervised adaptation algorithm, we designed three strategies to make it suitable for multi-domain adaptation: an in-domain negative sampling strategy, a MoCo-like memory bank scheme, and a CORAL-like distribution alignment. We conducted experiments using VoxCeleb2 as the source domain dataset and CN-Celeb1 as the target multi-domain dataset. Our results demonstrate that our method clearly outperforms the basic self-supervised adaptation method, which simply treats the data of CN-Celeb1 as a single domain. Importantly, the improvement is consistent in nearly all in-domain tests and cross-domain tests, demonstrating the effectiveness of our proposed method.
- “A review on speaker recognition: Technology and challenges,” Computers & Electrical Engineering, vol. 90, pp. 107005, 2021.
- “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP. IEEE, 2018, pp. 5329–5333.
- “The 2021 NIST speaker recognition evaluation,” arXiv preprint arXiv:2204.10242, 2022.
- “VoxSRC 2022: The fourth voxceleb speaker recognition challenge,” arXiv preprint arXiv:2302.10248, 2023.
- “Self-supervised learning based domain adaptation for robust speaker verification,” in ICASSP. IEEE, 2021, pp. 5834–5838.
- “Cluster-guided unsupervised domain adaptation for deep speaker embedding,” IEEE Signal Processing Letters, 2023.
- “CN-Celeb: a challenging Chinese speaker recognition dataset,” in ICASSP. IEEE, 2020, pp. 7604–7608.
- “VoxCeleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, pp. 101027, 2020.
- “Unsupervised clustering approaches for domain adaptation in speaker recognition systems,” Odyssey, 2014.
- “EDITnet: A lightweight network for unsupervised domain adaptation in speaker verification,” arXiv preprint arXiv:2206.07548, 2022.
- “Class-aware distribution alignment based unsupervised domain adaptation for speaker verification,” in INTERSPEECH, 2022.
- “Adversarial training for multi-domain speaker recognition,” in ISCSLP. IEEE, 2021, pp. 1–5.
- “Domain robust deep embedding learning for speaker recognition,” in ICASSP. IEEE, 2022, pp. 7182–7186.
- “The CORAL+ algorithm for unsupervised domain adaptation of PLDA,” in ICASSP. IEEE, 2019, pp. 5821–5825.
- “The CORAL++ algorithm for unsupervised domain adaptation of speaker recognition,” in ICASSP. IEEE, 2022, pp. 7172–7176.
- “Reducing domain mismatch by maximum mean discrepancy based autoencoders.,” in Odyssey, 2018, pp. 162–167.
- “Multi-level deep neural network adaptation for speaker verification using mmd and consistency regularization,” in ICASSP. IEEE, 2020, pp. 6839–6843.
- “Deep coral: Correlation alignment for deep domain adaptation,” in Computer Vision–ECCV 2016 Workshops. Springer, 2016, pp. 443–450.
- “Speaker verification using end-to-end adversarial language adaptation,” in ICASSP. IEEE, 2019, pp. 6006–6010.
- “Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification,” in ICASSP. IEEE, 2019, pp. 6226–6230.
- “Contrastive domain adaptation via delimitation discriminator,” in ICASSP. IEEE, 2023, pp. 1–5.
- “Contrastive self-supervised learning for text-independent speaker verification,” in ICASSP. IEEE, 2021, pp. 6713–6717.
- “Label-efficient self-supervised speaker verification with information maximization and contrastive learning,” arXiv preprint arXiv:2207.05506, 2022.
- “Learning speaker representations with mutual information,” arXiv preprint arXiv:1812.00271, 2018.
- “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020, pp. 9729–9738.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
- “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP. IEEE, 2017, pp. 5220–5224.
- “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
- “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in APSIPA ASC. IEEE, 2019, pp. 1652–1656.
- Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-SNE,” Journal of machine learning research, vol. 9, no. 11, 2008.