Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Who is Authentic Speaker (2405.00248v1)

Published 30 Apr 2024 in cs.SD, cs.AI, cs.MM, and eess.AS

Abstract: Voice conversion (VC) using deep learning technologies can now generate high quality one-to-many voices and thus has been used in some practical application fields, such as entertainment and healthcare. However, voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes. Moreover, it is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly. In this paper we attempt to explore the feasibility of identifying authentic speakers from converted voices. This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices. Therefore our experiments are geared towards recognising the source speakers given the converted voices, which are generated by using FragmentVC on the randomly paired utterances from source and target speakers. To improve the robustness against converted voices, our recognition model is constructed by using hierarchical vector of locally aggregated descriptors (VLAD) in deep neural networks. The authentic speaker recognition system is mainly tested in two aspects, including the impact of quality of converted voices and the variations of VLAD. The dataset used in this work is VCTK corpus, where source and target speakers are randomly paired. The results obtained on the converted utterances show promising performances in recognising authentic speakers from converted voices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. D. Childers, K. Wu, D. Hicks, and B. Yegnanarayana, “Voice conversion,” Speech Communication, vol. 8, no. 2, pp. 147–158, 1989.
  2. S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Communication, vol. 88, pp. 65–82, 2017.
  3. T. Walczyna and Z. Piotrowski, “Overview of voice conversion methods based on deep learning,” Applied Sciences, 2023. [Online]. Available: https://doi.org/10.3390/app13053100
  4. Z. Wu and H. Li, “Voice conversion versus speaker verification: An overview,” APSIPA Transactions on Signal and Information Processing, vol. 3, 12 2014.
  5. F. Mukhneri, I. Wijayanto, and S. Hadiyoso, “Voice conversion for dubbing using linear predictive coding and hidden markov model,” Journal of Southwest Jiaotong University, vol. 55, 01 2020.
  6. W.-C. Huang, L. Violeta, S. Liu, J. Shi, and T. Toda, “The singing voice conversion challenge 2023,” 12 2023, pp. 1–8.
  7. X. Chen, W. Chu, J. Guo, and N. Xu, “Singing voice conversion with non-parallel data,” 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 292–296, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:73729115
  8. I. Hernáez-Rioja, J. A. Gonzalez-Lopez, and H. Christensen, “Special issue on applications of speech and language technologies in healthcare,” Applied Sciences, pp. 2–13, 2023. [Online]. Available: https://doi.org/10.3390/app1311684
  9. S. Raman, X. Sarasola, E. Navas, and I. Hernaez, “Enrichment of oesophageal speech: Voice conversion with duration–matched synthetic speech as target,” Applied Sciences, pp. 2–13, 2021. [Online]. Available: https://doi.org/10.3390/app11135940
  10. D. Cai, Z. Cai, and M. Li, “Identifying source speakers for voice conversion based spoofing attacks on speaker verification systems,” 2023.
  11. D. Mari, F. Latora, and S. Milani, “The sound of silence: Efficiency of first digit features in synthetic audio detection,” in 2022 IEEE International Workshop on Information Forensics and Security (WIFS), 2022, pp. 1–6.
  12. S. Borzì, O. Giudice, F. Stanco, and D. Allegra, “Is synthetic voice detection research going into the right direction?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 71–80.
  13. Y. Mo and S. Wang, “Multi-task learning improves synthetic speech detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6392–6396.
  14. T. P. Doan, K. Hong, and S. Jung, “Gan discriminator based audio deepfake detection,” in Proceedings of the 2nd Workshop on Security Implications of Deepfakes and Cheapfakes, 2023, pp. 29––32.
  15. F. Li, Y. Chen, H. Liu, Z. Zhao, Y. Yao, and X. Liao, “Vocoder detection of spoofing speech based on gan fingerprints and domain generalization,” pp. 1–20, 2024.
  16. Y.-H. Chen, D.-Y. Wu, T.-H. Wu, and H.-y. Lee, “Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5954–5958.
  17. T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” 2018, pp. 2100–2104.
  18. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks,” 2018, pp. 266–273.
  19. W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” ArXiv, vol. abs/1804.05160, 2018.
  20. W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” 05 2019, pp. 5791–5795.
  21. A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: a framework for self-supervised learning of speech representations,” 2020, pp. 12 449–12 460.
  22. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019, pp. 4171–4186.
  23. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation.”   Springer International Publishing, 2015, pp. 234–241.
  24. Y. Y. Lin, C.-M. Chien, J. hao Lin, H. yi Lee, and L.-S. Lee, “Fragmentvc: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5939–5943. [Online]. Available: https://api.semanticscholar.org/CorpusID:225076127
  25. J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving robust universal neural vocoding,” 2019, pp. 181–185.
  26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  27. Y. Zhong, R. Arandjelović, and A. Zisserman, “Ghostvlad for set-based face recognition,” in Computer Vision – ACCV 2018, 2019, pp. 35–50.
  28. H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3304–3311.
  29. J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR).
  30. W. contributors, “Mean opinion score, the free encyclopedia,” 2024. [Online]. Available: https://en.wikipedia.org/wiki/Mean-opinion-score

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com