Voxceleb-ESP: preliminary experiments detecting Spanish celebrities from their voices (2401.09441v1)
Abstract: This paper presents VoxCeleb-ESP, a collection of pointers and timestamps to YouTube videos facilitating the creation of a novel speaker recognition dataset. VoxCeleb-ESP captures real-world scenarios, incorporating diverse speaking styles, noises, and channel distortions. It includes 160 Spanish celebrities spanning various categories, ensuring a representative distribution across age groups and geographic regions in Spain. We provide two speaker trial lists for speaker identification tasks, each of them with same-video or different-video target trials respectively, accompanied by a cross-lingual evaluation of ResNet pretrained models. Preliminary speaker identification results suggest that the complexity of the detection task in VoxCeleb-ESP is equivalent to that of the original and much larger VoxCeleb in English. VoxCeleb-ESP contributes to the expansion of speaker recognition benchmarks with a comprehensive and diverse dataset for the Spanish language.
- J. Gonzalez-Rodriguez, “Evaluating automatic speaker recognition systems: An overview of the nist speaker recognition evaluations (1996-2014),” Loquens, vol. 1, no. 1, pp. e007–e007, 2014.
- P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345–354, 2005.
- L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
- A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620.
- J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
- M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Speakers in the Wild (SITW) Speaker Recognition Database,” in Proc. Interspeech 2016, 2016, pp. 818–822.
- Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Z. Zhou, Y. Cai, and D. Wang, “Cn-celeb: A challenging chinese speaker recognition dataset,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7604–7608.
- V. T. Pham, X. T. H. Nguyen, V. Hoang, and T. T. T. Nguyen, “Vietnam-Celeb: a large-scale dataset for Vietnamese speaker recognition,” in Proc. Interspeech 2023, 2023, pp. 1918–1922.
- J. Mendonca and I. Trancoso, “Voxceleb-pt – a dataset for a speech processing course,” in Proc. IberSPEECH 2022, 2022, pp. 71–75.
- J. Ortega-Garcia, J. Gonzalez-Rodriguez, V. Marrero-Aguiar, J. Diaz-Gomez, R. Garcia-Jimenez, J. Lucena-Molina, and J. Sanchez-Molero, “Ahumada: a large speech corpus in spanish for speaker identification and verification,” in ICASSP 1998 - 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Cat. No.98CH36181), vol. 2, 1998, pp. 773–776 vol.2.
- J. Ortega-Garcia, J. Gonzalez-Rodriguez, and V. Marrero-Aguiar, “Ahumada: A large speech corpus in spanish for speaker characterization and identification,” Speech Communication, vol. 31, no. 2, pp. 255–264, 2000. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167639399000813
- J. M. Martin, I. Lopez-Espejo, C. Gonzalez-Lao, D. Gallardo-Jimenez, A. M. Gomez García, J. L. Pérez Cordoba, V. E. Sanchez Calle, J. A. Morales Cordovilla, and A. M. Peinado Herreros, “Secuvoice - a spanish speech corpus for secure applications with smartphones,” in Proc. IberSPEECH 2016, 2016.
- S. Tomar, “Converting video formats with ffmpeg,” Linux Journal, vol. 2006, no. 146, p. 10, 2006.
- J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In Defence of Metric Learning for Speaker Recognition,” in Proc. Interspeech 2020, 2020, pp. 2977–2981.
- Y. Kwon, H. S. Heo, B.-J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: lessons from VoxSRC 2020,” in Proc. ICASSP, 2021.