Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Domain Audio Deepfake Detection: Dataset and Analysis (2404.04904v2)

Published 7 Apr 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1\% and 6.5\% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Transferring audio deepfake detection capability across languages. In Proc. ACM Web, pages 2033–2044.
  2. Wav2Vec 2.0: A framework for self-supervised learning of speech representations. Proc. NeurIPS.
  3. Seamless: Multilingual expressive and streaming speech translation. arXiv preprint arXiv:2312.05187.
  4. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In Proc. ICML, pages 2709–2720. PMLR.
  5. Deepfake speech detection through emotion recognition: a semantic approach. In Proc. ICASSP, pages 8962–8966. IEEE.
  6. As good as a coin toss human detection of ai-generated images, videos, audio, and audiovisual stimuli. arXiv preprint arXiv:2403.16760.
  7. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
  8. Joel Frank and Lea Schönherr. 2021. Wavefake: A data set to facilitate audio deepfake detection. In Proc. NeurIPS.
  9. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Proc. SPECOM, pages 198–208. Springer.
  10. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100.
  11. Attack agnostic dataset: Towards generalization and stabilization of audio deepfake detection.
  12. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
  13. Whamr!: Noisy and reverberant single-channel speech separation. In Proc. ICASSP, pages 696–700. IEEE.
  14. Juan M Martín-Doñas and Aitor Álvarez. 2022. The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge. In Proc. ICASSP, pages 9241–9245. IEEE.
  15. Does audio deepfake detection generalize? Interspeech 2022.
  16. Speaker recognition-assisted robust audio deepfake detection. In Proc. Interspeech, pages 4202–4206.
  17. Openvoice: Versatile instant voice cloning. arXiv preprint arXiv:2312.01479.
  18. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
  19. Attention is all you need in speech separation. In Proc. ICASSP, pages 21–25. IEEE.
  20. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
  21. Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64:101114.
  22. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection. In ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoofing Coutermeasures Challenge.
  23. Audio deepfake detection system with neural stitching for add 2022. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9226–9230. IEEE.
  24. A robust audio deepfake detection system via multi-view feature. In Proc. ICASSP, pages 13131–13135. IEEE.
  25. Add 2022: the first audio deep synthesis detection challenge. In Proc. ICASSP, pages 9216–9220. IEEE.
  26. Add 2023: the second audio deepfake detection challenge. arXiv preprint arXiv:2305.13774.
  27. Libritts: A corpus derived from librispeech for text-to-speech. Proc. Interspeech.
  28. Deepfake algorithm recognition system with augmented data for add 2023 challenge. In Proc. IJCAI Workshop on Deepfake Audio Detection and Analysis.
Citations (2)

Summary

We haven't generated a summary for this paper yet.