Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cosine Scoring with Uncertainty for Neural Speaker Embedding (2403.06404v1)

Published 11 Mar 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Z. Bai and X.-L. Zhang, “Speaker recognition based on deep learning: An overview,” in Neural Network, vol. 140, 2021, pp. 65–99.
  2. P. Matejka, O. Plchot, O. Glembek, L. Burget, J. Rohdin et al., “13 years of speaker recognition research at BUT, with longitudinal analysis of NIST SRE,” in Computer Speech & Language, vol. 63, 2020, p. 101035.
  3. J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations,” in Computer Speech & Language, vol. 60, 2020, p. 101026.
  4. A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Largescale speaker verification in the wild,” in Computer Speech & Language, vol. 60, 2020, p. 101027.
  5. K. A. Lee, H. Yamamoto, K. Okabe, Q. Wang, L. Guo et al., “NEC-TT system for mixed-bandwidth and multi-domain speaker recognition,” in Computer Speech & Language, vol. 61, 2020, p. 101033.
  6. L. Ferrer, M. McLaren, and N. Brummer, “A speaker verification backend with robust performance across conditions,” in Computer Speech & Language, vol. 71, 2022, p. 101258.
  7. N. Brummer, A. Silnova, L. Burget, and T. Stafylakis, “Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model,” in Proc. Odyssey, 2018, pp. 349–356.
  8. K. A. Lee, Q. Wang, and T. Koshinaka, “Xi-vector embedding for speaker recognition,” in IEEE Signal Processing Letters, 2021, pp. 1385–1389.
  9. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE ICASSP, 2018, pp. 5329–5333.
  10. K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018, pp. 2252–2256.
  11. J. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
  12. J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matejka, L. Burget, and O. Glembek, “End-to-end DNN based text-independent speaker recognition for long and short utterances,” in Computer Speech & Language, 2020, pp. 22–35.
  13. T. Liu, R. K. Das, K. A. Lee, and H. Li, “MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in Proc. IEEE ICASSP, 2022, pp. 7517–7521.
  14. T. Liu, K. A. Lee, Q. Wang, and H. Li, “Golden gemini is all you need: Finding the sweet spots for speaker verification,” arXiv preprint arXiv:2312.03620, 2023.
  15. S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. ECCV, 2006, pp. 531–542.
  16. S. Prince and J. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. ICCV, 2007, pp. 1–8.
  17. Q. Wang, K. Okabe, K. A. Lee, and T. Koshinaka, “Generalized domain adaptation framework for parametric back-end in speaker recognition,” in IEEE Transactions on Information Forensics and Security, vol. 18, 2023, pp. 3936–3947.
  18. K. Takahashi and T. Murakami, “A measure of information gained through biometric systems,” in Image and Vision Computing, vol. 32, 2014, pp. 1194–1203.
  19. H. Zeinali, K. A. Lee, J. Alam, and L. Burget, “SdSV challenge 2020: Large-scale evaluation of short-duration speaker verification,” in Proc. Interspeech, 2020, pp. 731–735.
  20. A. Silnova, N. Brummer, J. Rohdin, T. Stafylakis, and L. Burget, “Probabilistic embeddings for speaker diarization,” in Proc. Odyssey, 2020, pp. 24–31.
  21. T. Liu, K. A. Lee, Q. Wang, and H. Li, “Disentangling voice and content with self-supervision for speaker recognition,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.
  22. Q. Wang, K. A. Lee, and T. Liu, “Incorporating uncertainty from speaker embedding estimation to speaker verification,” in Proc. IEEE ICASSP, 2023.
  23. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. IEEE ICASSP, 2018, pp. 4879–4883.
  24. Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker verification,” in Proc. Interspeech, 2019, pp. 2873–2877.
  25. D. Zhou, L. Wang, K. A. Lee, Y. Wu, M. Liu, J. Dang, and J. Wei, “Dynamic margin softmax loss for speaker verification,” in Proc. Interspeech, 2020, pp. 3800–3804.
  26. Z. Li and M.-W. Mak, “Speaker representation learning via contrastive loss with maximal speaker separability,” in Proc. APSIPA, 2022.
  27. H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in IEEE CVPR, 2018, pp. 5265–5274.
  28. F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” in IEEE Signal Processing Letters, vol. 25, 2018, pp. 926–930.
  29. J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in IEEE CVPR, 2019, pp. 4690–4699.
  30. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
  31. Q. Wang, K. A. Lee, and T. Liu, “Scoring of large-margin embeddings for speaker verification: Cosine or PLDA?” in Proc. Interspeech, 2022, pp. 600–604.
  32. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016.
  33. P. C. Mahalanobis, “On the generalized distance in statistics.” in Proceedings of the National Institute of Sciences (Calcutta), vol. 2, 1936, pp. 49––55.
  34. A. Nagrani, J. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
  35. M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (SITW) speaker recognition database,” in Proc. Interspeech, 2016, pp. 818–822.
  36. T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proc. Interspeech, 2015, pp. 3586–3589.
  37. D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. Cubuk, and Q. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
  38. T. Ko, V. Peddinti, D. Povey, M. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE ICASSP, 2017, pp. 5220–5224.
  39. M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell et al., “SpeechBrain: A general-purpose speech toolkit,” in arXiv:2106.04624, 2021.

Summary

We haven't generated a summary for this paper yet.