Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Voxceleb-ESP: preliminary experiments detecting Spanish celebrities from their voices (2401.09441v1)

Published 20 Dec 2023 in cs.SD, cs.LG, and eess.AS

Abstract: This paper presents VoxCeleb-ESP, a collection of pointers and timestamps to YouTube videos facilitating the creation of a novel speaker recognition dataset. VoxCeleb-ESP captures real-world scenarios, incorporating diverse speaking styles, noises, and channel distortions. It includes 160 Spanish celebrities spanning various categories, ensuring a representative distribution across age groups and geographic regions in Spain. We provide two speaker trial lists for speaker identification tasks, each of them with same-video or different-video target trials respectively, accompanied by a cross-lingual evaluation of ResNet pretrained models. Preliminary speaker identification results suggest that the complexity of the detection task in VoxCeleb-ESP is equivalent to that of the original and much larger VoxCeleb in English. VoxCeleb-ESP contributes to the expansion of speaker recognition benchmarks with a comprehensive and diverse dataset for the Spanish language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. J. Gonzalez-Rodriguez, “Evaluating automatic speaker recognition systems: An overview of the nist speaker recognition evaluations (1996-2014),” Loquens, vol. 1, no. 1, pp. e007–e007, 2014.
  2. P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345–354, 2005.
  3. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
  4. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
  5. A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620.
  6. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
  7. M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Speakers in the Wild (SITW) Speaker Recognition Database,” in Proc. Interspeech 2016, 2016, pp. 818–822.
  8. Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Z. Zhou, Y. Cai, and D. Wang, “Cn-celeb: A challenging chinese speaker recognition dataset,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7604–7608.
  9. V. T. Pham, X. T. H. Nguyen, V. Hoang, and T. T. T. Nguyen, “Vietnam-Celeb: a large-scale dataset for Vietnamese speaker recognition,” in Proc. Interspeech 2023, 2023, pp. 1918–1922.
  10. J. Mendonca and I. Trancoso, “Voxceleb-pt – a dataset for a speech processing course,” in Proc. IberSPEECH 2022, 2022, pp. 71–75.
  11. J. Ortega-Garcia, J. Gonzalez-Rodriguez, V. Marrero-Aguiar, J. Diaz-Gomez, R. Garcia-Jimenez, J. Lucena-Molina, and J. Sanchez-Molero, “Ahumada: a large speech corpus in spanish for speaker identification and verification,” in ICASSP 1998 - 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Cat. No.98CH36181), vol. 2, 1998, pp. 773–776 vol.2.
  12. J. Ortega-Garcia, J. Gonzalez-Rodriguez, and V. Marrero-Aguiar, “Ahumada: A large speech corpus in spanish for speaker characterization and identification,” Speech Communication, vol. 31, no. 2, pp. 255–264, 2000. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167639399000813
  13. J. M. Martin, I. Lopez-Espejo, C. Gonzalez-Lao, D. Gallardo-Jimenez, A. M. Gomez García, J. L. Pérez Cordoba, V. E. Sanchez Calle, J. A. Morales Cordovilla, and A. M. Peinado Herreros, “Secuvoice - a spanish speech corpus for secure applications with smartphones,” in Proc. IberSPEECH 2016, 2016.
  14. S. Tomar, “Converting video formats with ffmpeg,” Linux Journal, vol. 2006, no. 146, p. 10, 2006.
  15. J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In Defence of Metric Learning for Speaker Recognition,” in Proc. Interspeech 2020, 2020, pp. 2977–2981.
  16. Y. Kwon, H. S. Heo, B.-J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: lessons from VoxSRC 2020,” in Proc. ICASSP, 2021.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets