Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 57 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

VoxWatch: An open-set speaker recognition benchmark on VoxCeleb (2307.00169v1)

Published 30 Jun 2023 in eess.AS, cs.AI, and cs.LG

Abstract: Despite its broad practical applications such as in fraud prevention, open-set speaker identification (OSI) has received less attention in the speaker recognition community compared to speaker verification (SV). OSI deals with determining if a test speech sample belongs to a speaker from a set of pre-enrolled individuals (in-set) or if it is from an out-of-set speaker. In addition to the typical challenges associated with speech variability, OSI is prone to the "false-alarm problem"; as the size of the in-set speaker population (a.k.a watchlist) grows, the out-of-set scores become larger, leading to increased false alarm rates. This is in particular challenging for applications in financial institutions and border security where the watchlist size is typically of the order of several thousand speakers. Therefore, it is important to systematically quantify the false-alarm problem, and develop techniques that alleviate the impact of watchlist size on detection performance. Prior studies on this problem are sparse, and lack a common benchmark for systematic evaluations. In this paper, we present the first public benchmark for OSI, developed using the VoxCeleb dataset. We quantify the effect of the watchlist size and speech duration on the watchlist-based speaker detection task using three strong neural network based systems. In contrast to the findings from prior research, we show that the commonly adopted adaptive score normalization is not guaranteed to improve the performance for this task. On the other hand, we show that score calibration and score fusion, two other commonly used techniques in SV, result in significant improvements in OSI performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. S. O. Sadjadi, C. Greenberg, E. Singer, D. Reynolds, L. Mason, and J. Hernandez-Cordero, “The 2018 NIST speaker recognition evaluation,” in Proc. INTERSPEECH, 2019.
  2. J. S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D. A. Reynolds, and A. Zisserman, “VoxSRC 2019: The first VoxCeleb speaker recognition challenge,” arXiv preprint arXiv:1912.02522, 2019.
  3. V. Ramasubramanian, “Speaker spotting: Automatic telephony surveillance for homeland security,” Forensic speaker recognition: law enforcement and counter-terrorism, pp. 427–468, 2012.
  4. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Process., vol. 10, no. 1-3, pp. 19–41, 2000.
  5. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2010.
  6. J. H. Hansen and H. Bořil, “On the issues of intra-speaker variability and realism in speech, speaker, and language recognition tasks,” Speech Communication, vol. 101, pp. 94–108, 2018.
  7. E. Singer and D. A. Reynolds, “Analysis of multitarget detection for speaker and language recognition,” in Proc. Speaker Odyssey Workshop, 2004, pp. 301–308.
  8. A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Comput. Speech Lang., vol. 60, 2020.
  9. C. Gao, G. Saikumar, A. Srivastava, and P. Natarajan, “Open-set speaker identification in broadcast news,” in Proc. IEEE ICASSP, 2011, pp. 5280–5283.
  10. E. Khoury, K. Lakhdhar, A. Vaughan, G. Sivaraman, and P. Nagarsheth, “Pindrop Labs’ submission to the first multi-target speaker detection and identification challenge,” in Proc. INTERSPEECH, 2019, pp. 1502–1505.
  11. R. Font, “A denoising autoencoder for speaker recognition. results on the MCE 2018 challenge,” in Proc. IEEE ICASSP, 2019, pp. 6016–6020.
  12. K. Wilkinghoff, “On open-set speaker identification with i-vectors,” in Proc. Speaker Odyssey Workshop, 2020, pp. 408–414.
  13. M. Trnka, S. Darjaa, M. Rusko, M. Schaper, and T. H. Stelkens-Kobsch, “Speaker authorization for air traffic control security,” in Proc. SPECOM.   Springer, 2021, pp. 716–725.
  14. S. Shon, N. Dehak, D. Reynolds, and J. Glass, “MCE 2018: The 1st multi-target speaker detection and identification challenge evaluation,” in Proc. INTERSPEECH, 2019, pp. 356–360.
  15. Y. Zigel and M. Wasserblat, “How to deal with multiple-targets in speaker identification systems?” in Proc. Speaker Odyssey Workshop, 2006, pp. 1–7.
  16. J. Fortuna, “Relative effectiveness of score normalisation methods in open-set speaker identification,” in Proc. Speaker Odyssey Workshop, 2004.
  17. S. Cumani, P. D. Batzu, D. Colibro, C. Vair, P. Laface, and V. Vasilakakis, “Comparison of speaker recognition approaches for real applications.” in Proc. INTERSPEECH, 2011, pp. 2365–2368.
  18. D. Garcia-Romero, J. Fiérrez-Aguilar, J. Gonzalez-Rodriguez, and J. Ortega-Garcia, “On the use of quality measures for text-independent speaker recognition,” in ODYSSEY04-the speaker and language recognition workshop, 2004.
  19. J. Thienpondt, B. Desplanques, and K. Demuynck, “The idlab voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in dnn based speaker verification,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5814–5818.
  20. T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Bořil, and J. H. Hansen, “Crss systems for 2012 nist speaker recognition evaluation,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 6783–6787.
  21. J. Huh, A. Brown, J.-w. Jung, J. S. Chung, A. Nagrani, D. Garcia-Romero, and A. Zisserman, “Voxsrc 2022: The fourth voxceleb speaker recognition challenge,” arXiv preprint arXiv:2302.10248, 2023.
  22. S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022.
  23. K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. INTERSPEECH, 2018, pp. 2252–2256.
  24. J. S. Chung et al., “In defence of metric learning for speaker recognition,” in Proc. INTERSPEECH, 2020, pp. 2977–2981.
  25. D. Garcia-Romero, G. Sell, and A. McCree, “MagNetO: X-vector magnitude estimation network plus offset for improved speaker recognition.” in Proc. Speaker Odyssey Workshop, 2020, pp. 1–8.
  26. D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint:1510.08484, 2015.
  27. T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE ICASSP, 2017, pp. 5220–5224.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube