I Can Hear You: Selective Robust Training for Deepfake Audio Detection (2411.00121v1)
Abstract: Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.
- Detection of copy-move forgery in audio signal with mel frequency and delta-mel frequency kepstrum coefficients. In 2021 Innovations in Intelligent Systems and Applications Conference (ASYU), pp. 1–6. IEEE, 2021.
- Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501, 2019.
- James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
- Is synthetic voice detection research going into the right direction? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 71–80, 2022.
- Guillermo Calahorra-Candao and MarÃa José MartÃn-de Hoyos. The effect of anthropomorphism of virtual voice assistants on perceived safety as an antecedent to voice shopping. Computers in Human Behavior, 153:108124, 2024.
- Xtts: a massively multilingual zero-shot text-to-speech model. arXiv preprint arXiv:2406.04904, 2024.
- Towards understanding and mitigating audio adversarial examples for speaker recognition. IEEE Transactions on Dependable and Secure Computing, 20(5):3970–3987, 2023. doi: 10.1109/TDSC.2022.3220673.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703, 2020.
- Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020.
- Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024.
- Wavefake: A data set to facilitate audio deepfake detection. arXiv preprint arXiv:2111.02813, 2021.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017.
- {{\{{WaveGuard}}\}}: Understanding and mitigating audio adversarial examples. In 30th USENIX security symposium (USENIX Security 21), pp. 2273–2290, 2021.
- Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024.
- Pushing the limits of raw waveform speaker recognition. arXiv preprint arXiv:2203.08488, 2022.
- Replay and synthetic speech detection with res2net architecture. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6354–6358. IEEE, 2021.
- Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. Advances in Neural Information Processing Systems, 36, 2024.
- Meta-voice: Fast few-shot style transfer for expressive voice cloning using meta learning. arXiv preprint arXiv:2111.07218, 2021.
- Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2507–2522, 2023.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Metric learning for adversarial robustness. Advances in neural information processing systems, 32, 2019.
- Does audio deepfake detection generalize? arXiv preprint arXiv:2203.16263, 2022.
- Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
- Improving robustness of llm-based speech synthesis by learning monotonic alignment. arXiv preprint arXiv:2406.17957, 2024.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
- Voicecraft: Zero-shot speech editing and text-to-speech in the wild. arXiv preprint arXiv:2403.16973, 2024.
- Diffusion-based voice conversion with fast maximum likelihood sampling scheme. arXiv preprint arXiv:2109.13821, 2021.
- Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp. 28492–28518. PMLR, 2023.
- Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition. arXiv preprint arXiv:1809.04397, 2018.
- Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT), pp. 1021–1028. IEEE, 2018.
- Md Sahidullah and Goutam Saha. Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition. Speech communication, 54(4):543–565, 2012.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
- Ai-synthesized voice detection using neural vocoder artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 904–912, 2023.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- End-to-end anti-spoofing with rawnet2. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6369–6373. IEEE, 2021.
- Constant q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language, 45:516–535, 2017.
- Asvspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441, 2019.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Defending against adversarial audio via diffusion model. arXiv preprint arXiv:2303.01507, 2023.
- Listening to sounds of silence for speech denoising. Advances in Neural Information Processing Systems, 33:9633–9648, 2020.
- Junichi Yamagishi. English multi-speaker corpus for cstr voice cloning toolkit. URL http://homepages. inf. ed. ac. uk/jyamagis/page3/page58/page58. html, 2012.
- A robust audio deepfake detection system via multi-view feature. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13131–13135. IEEE, 2024.
- Learning filterbanks from raw speech for phone recognition. In 2018 IEEE international conference on acoustics, speech and signal Processing (ICASSP), pp. 5509–5513. IEEE, 2018.
- Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472–7482. PMLR, 2019.
- Fake speech detection using residual network with transformer encoder. In Proceedings of the 2021 ACM workshop on information hiding and multimedia security, pp. 13–22, 2021.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.