Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cluster-Guided Unsupervised Domain Adaptation for Deep Speaker Embedding (2303.15944v1)

Published 28 Mar 2023 in cs.LG, cs.SD, and eess.AS

Abstract: Recent studies have shown that pseudo labels can contribute to unsupervised domain adaptation (UDA) for speaker verification. Inspired by the self-training strategies that use an existing classifier to label the unlabeled data for retraining, we propose a cluster-guided UDA framework that labels the target domain data by clustering and combines the labeled source domain data and pseudo-labeled target domain data to train a speaker embedding network. To improve the cluster quality, we train a speaker embedding network dedicated for clustering by minimizing the contrastive center loss. The goal is to reduce the distance between an embedding and its assigned cluster center while enlarging the distance between the embedding and the other cluster centers. Using VoxCeleb2 as the source domain and CN-Celeb1 as the target domain, we demonstrate that the proposed method can achieve an equal error rate (EER) of 8.10% on the CN-Celeb1 evaluation set without using any labels from the target domain. This result outperforms the supervised baseline by 39.6% and is the state-of-the-art UDA performance on this corpus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. J. H. L. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 74–99, Nov. 2015.
  2. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 5329–5333.
  3. G. Wilson and D. J. Cook, “A survey of unsupervised deep domain adaptation,” ACM Trans. Intell. Syst. Technol., vol. 11, no. 5, pp. 1–46, 2020.
  4. Z. Bai and X.-L. Zhang, “Speaker recognition based on deep learning: An overview,” Neural Netw., vol. 140, pp. 65–99, Aug. 2021.
  5. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 119, 2020, pp. 1597–1607.
  6. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9729–9738.
  7. J. Huh, H. S. Heo, J. Kang, S. Watanabe, and J. S. Chung, “Augmentation adversarial training for unsupervised speaker recognition,” in Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS, 2020.
  8. H. Zhang, Y. Zou, and H. Wang, “Contrastive self-supervised learning for text-independent speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021, pp. 6713–6717.
  9. W. Xia, C. Zhang, C. Weng, M. Yu, and D. Yu, “Self-supervised text-independent speaker verification using prototypical momentum contrastive learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021, pp. 6723–6727.
  10. Z. Chen, S. Wang, and Y. Qian, “Self-supervised learning based domain adaptation for robust speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021, pp. 5834–5838.
  11. R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-supervised speaker recognition with loss-gated learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May. 2022, pp. 6142–6146.
  12. J. Thienpondt, B. Desplanques, and K. Demuynck, “The IDLAB voxceleb speaker recognition challenge 2020 system description,” arXiv preprint arXiv:2010.12468, 2020.
  13. D. Cai, W. Wang, and M. Li, “An iterative framework for self-supervised deep speaker representation learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021, pp. 6728–6732.
  14. M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 132–149.
  15. E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers, “Clustering with deep learning: Taxonomy and new methods,” arXiv preprint arXiv:1801.07648, 2018.
  16. W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 212–220.
  17. H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 5265–5274.
  18. F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Process. Lett., vol. 25, no. 7, pp. 926–930, Jul. 2018.
  19. J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proc. IEEE/CVF Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4690–4699.
  20. A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “VoxCeleb: Large-scale speaker verification in the wild,” Comput. Speech Lang., vol. 60, 2020, Art. no. 101027. [Online]. Available: http://www.sciencedirect. com/science/article/pii/S0885230819302712
  21. Y. Fan, J. W. Kang, L. T. Li, K. C. Li, H. L. Chen, S. T. Cheng, P. Y. Zhang, Z. Y. Zhou, Y. Q. Cai, and D. Wang, “CN-Celeb: A challenging chinese speaker recognition dataset,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May. 2020, pp. 7604–7608.
  22. D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  23. T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 5220–5224.
  24. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 1–5.
  25. X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in Proc. Asia–Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA ASC), Nov. 2019, pp. 1652–1656.
  26. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  27. J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” arXiv preprint arXiv:2003.11982, 2020.
  28. T. Calinski and J. Harabasz, “A dendrite method for cluster analysis,” Commun. Statist. Theory Methods, vol. 3, no. 1, pp. 1–27, 1974.
  29. P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” J. Comp. App. Math, vol. 20, pp. 53–65, Nov. 1987.
  30. L. Li, R. Liu, J. Kang, Y. Fan, H. Cui, Y. Cai, R. Vipperla, T. F. Zheng, and D. Wang, “CN-Celeb: multi-genre speaker recognition,” Speech Communication, vol. 137, pp. 77-91, 2022.
  31. R. Zhang, J. Wei, X. Lu, W. Lu, D. Jin, J. Xu, L. Zhang, Y. Ji, and J. Dang, “TMS: A temporal multi-scale backbone design for speaker embedding,” arXiv preprint arXiv:2203.09098, 2022.
  32. C. Zeng, X. Wang, E. Cooper, and J. Yamagishi, “Attention back-end for automatic speaker verification with multiple enrollment utterances,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May. 2022, pp. 6717-6721.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Haiquan Mao (1 paper)
  2. Feng Hong (18 papers)
  3. Man-Wai Mak (15 papers)
Citations (8)