Cross-Domain Audio Deepfake Detection: Dataset and Analysis (2404.04904v2)
Abstract: Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1\% and 6.5\% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research.
- Transferring audio deepfake detection capability across languages. In Proc. ACM Web, pages 2033–2044.
- Wav2Vec 2.0: A framework for self-supervised learning of speech representations. Proc. NeurIPS.
- Seamless: Multilingual expressive and streaming speech translation. arXiv preprint arXiv:2312.05187.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In Proc. ICML, pages 2709–2720. PMLR.
- Deepfake speech detection through emotion recognition: a semantic approach. In Proc. ICASSP, pages 8962–8966. IEEE.
- As good as a coin toss human detection of ai-generated images, videos, audio, and audiovisual stimuli. arXiv preprint arXiv:2403.16760.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
- Joel Frank and Lea Schönherr. 2021. Wavefake: A data set to facilitate audio deepfake detection. In Proc. NeurIPS.
- Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Proc. SPECOM, pages 198–208. Springer.
- Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100.
- Attack agnostic dataset: Towards generalization and stabilization of audio deepfake detection.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
- Whamr!: Noisy and reverberant single-channel speech separation. In Proc. ICASSP, pages 696–700. IEEE.
- Juan M Martín-Doñas and Aitor Álvarez. 2022. The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge. In Proc. ICASSP, pages 9241–9245. IEEE.
- Does audio deepfake detection generalize? Interspeech 2022.
- Speaker recognition-assisted robust audio deepfake detection. In Proc. Interspeech, pages 4202–4206.
- Openvoice: Versatile instant voice cloning. arXiv preprint arXiv:2312.01479.
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
- Attention is all you need in speech separation. In Proc. ICASSP, pages 21–25. IEEE.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
- Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64:101114.
- Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection. In ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoofing Coutermeasures Challenge.
- Audio deepfake detection system with neural stitching for add 2022. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9226–9230. IEEE.
- A robust audio deepfake detection system via multi-view feature. In Proc. ICASSP, pages 13131–13135. IEEE.
- Add 2022: the first audio deep synthesis detection challenge. In Proc. ICASSP, pages 9216–9220. IEEE.
- Add 2023: the second audio deepfake detection challenge. arXiv preprint arXiv:2305.13774.
- Libritts: A corpus derived from librispeech for text-to-speech. Proc. Interspeech.
- Deepfake algorithm recognition system with augmented data for add 2023 challenge. In Proc. IJCAI Workshop on Deepfake Audio Detection and Analysis.