Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024 (2409.02302v1)

Published 3 Sep 2024 in eess.AS, cs.AI, and cs.SD

Abstract: This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at https://github.com/Anmol2059/SVDD2024.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. “Asvspoof: The automatic speaker verification spoofing and countermeasures challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, 2017.
  2. “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech, 2019, pp. 1008–1012.
  3. “Asvspoof 5 evaluation plan,” 2024.
  4. You Zhang, Fei Jiang and Zhiyao Duan, “One-class learning towards synthetic voice spoofing detection,” IEEE Signal Processing Letters, vol. 28, pp. 937–941, 2021.
  5. “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in Proc. ICASSP, 2022, pp. 6367–6371.
  6. Haibin Wu, Yuan Tseng and Hung-yi Lee, “Codecfake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,” arXiv preprint arXiv:2406.07237, 2024.
  7. “Singfake: Singing voice deepfake detection,” in Proc. ICASSP, 2024, pp. 12156–12160.
  8. “Ctrsvdd: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection,” arXiv preprint arXiv:2406.02438, 2024.
  9. “Fsd: An initial chinese dataset for fake song detection,” in Proc. ICASSP, 2024, pp. 4605–4609.
  10. “Svdd challenge 2024: A singing voice deepfake detection challenge evaluation plan,” arXiv preprint arXiv:2405.05244, 2024.
  11. “Singing voice graph modeling for singfake detection,” 2024.
  12. “Svdd 2024: The inaugural singing voice deepfake detection challenge,” arXiv preprint arXiv:2408.16132, 2024.
  13. “Sa-wavlm: Speaker-aware self-supervised pre-training for mixture speech,” arXiv preprint arXiv:2407.02826, 2024.
  14. “Prompt-driven target speech diarization,” in Proc. ICASSP, 2024, pp. 11086–11090.
  15. “Target speech diarization with multimodal prompts,” arXiv preprint arXiv:2406.07198, 2024.
  16. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  17. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, vol. 33, pp. 12449–12460.
  18. “The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2023.
  19. “How do neural spoofing countermeasures detect partially spoofed audio?,” arXiv preprint arXiv:2406.02483, 2024.
  20. “Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?,” in Proc. ICASSP, 2024, pp. 10311–10315.
  21. “The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,” in Proc. ICASSP, 2022, pp. 9241–9245.
  22. “Improving short utterance anti-spoofing with aasist2,” in Proc. ICASSP, 2024, pp. 11636–11640.
  23. “Spoofing attack augmentation: Can differently-trained attack models improve generalisation?,” in Proc. ICASSP, 2024, pp. 12531–12535.
  24. “Attentive merging of hidden embeddings from pre-trained speech model for anti-spoofing detection,” arXiv preprint arXiv:2406.10283, 2024.
  25. Jie Hu, Li Shen and Gang Sun, “Squeeze-and-excitation networks,” in Proc. CVPR, June 2018.
  26. “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in Proc. ICASSP, 2022, pp. 6382–6386.
  27. “Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms,” in Proc. Interspeech, 2020, pp. 1496–1500.
  28. “Temporal-channel modeling in multi-head self-attention for synthetic speech detection,” arXiv preprint arXiv:2406.17376, 2024.
  29. “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
  30. “Re-investigating the Efficient Transfer Learning of Speech Foundation Model using Feature Fusion Methods,” in Proc. Interspeech, 2023, pp. 556–560.
  31. “MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in Proc. ICASSP, 2022, pp. 7517–7521.
  32. “ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,” in Proc. Interspeech, 2019, pp. 1013–1017.
  33. “Adversarial attacks on spoofing countermeasures of automatic speaker verification,” in Proc. ASRU, 2019, pp. 312–319.
  34. Brecht Desplanques, Jenthe Thienpondt and Kris Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
  35. “Disentangling voice and content with self-supervision for speaker recognition,” in Proc. NeurIPS, 2023, vol. 36, pp. 50221–50236.
  36. “Golden Gemini is all you need: Finding the sweet spots for speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2324–2337, 2024.
  37. “Ensemble Models for Spoofing Detection in Automatic Speaker Verification,” in Proc. Interspeech, 2019, pp. 1018–1022.
  38. “The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge,” in Proc. Interspeech, 2019, pp. 1038–1042.
  39. “Jvs-music: Japanese multispeaker singing-voice corpus,” arXiv preprint arXiv:2001.07044, 2020.
  40. “Tohoku kiritan singing database: A singing database for statistical parametric singing synthesis using japanese pop songs,” Acoustical Science and Technology, vol. 42, no. 3, pp. 140–145, 2021.
  41. “Automatic Speaker Verification Spoofing and Deepfake Detection Using wav2vec 2.0 and Data Augmentation,” in Proc. Odyssey, 2022, pp. 112–119.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub