Detecting music deepfakes is easy but actually hard (2405.04181v2)
Abstract: In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers.
- The Guardian, “Deepfakes v pre-bunking: is russia losing the infowar?” https://www.theguardian.com/world/2022/mar/19/russia-ukraine-infowar-deepfakes, 2022.
- Futurism, “Bone-chilling ai scam fakes your loved ones’ voices to demand hostage ransom,” https://futurism.com/the-byte/ai-voice-hostage-scam, 2022.
- Time, “How a new bill could protect against deepfakes,” https://time.com/6590711/deepfake-protection-federal-bill/, 2024.
- The Guardian, “How hollywood writers triumphed over ai – and why it matters,” https://www.theguardian.com/culture/2023/oct/01/hollywood-writers-strike-artificial-intelligence, 2023.
- E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, ser. FAccT ’21. ACM, 2021, p. 610–623. [Online]. Available: https://doi.org/10.1145/3442188.3445922
- H. H. Jiang, L. Brown, J. Cheng, M. Khan, A. Gupta, D. Workman, A. Hanna, J. Flowers, and T. Gebru, “Ai art and its impact on artists,” in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, ser. AIES ’23. ACM, 2023, p. 363–374.
- Time, “Openai used kenyan workers on less than $2 per hour to make chatgpt less toxic,” https://time.com/6247678/openai-chatgpt-kenya-workers, 2023.
- S. Gautam, P. N. Venkit, and S. Ghosh, “From melting pots to misrepresentations: Exploring harms in generative ai,” in GenAICHI ’24 in CHI Conference on Human Factors in Computing Systems, 2024.
- F. Schneider, O. Kamal, Z. Jin, and B. Schölkopf, “Moûsai: Efficient text-to-music diffusion models,” 2023.
- A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
- Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank et al., “Noise2music: Text-conditioned music generation with diffusion models,” arXiv preprint arXiv:2302.03917, 2023.
- D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- H. F. Garcia, P. Seetharaman, R. Kumar, and B. Pardo, “Vampnet: Music generation via masked acoustic token modeling,” arXiv preprint arXiv:2307.04686, 2023.
- J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1206–1210.
- “Riffusion,” https://www.riffusion.com, 2024.
- “Suno,” https://www.suno.ai, 2024.
- “Stable Audio,” https://stability.ai/stable-audio, 2024.
- “Udio,” https://www.udio.com/, 2024.
- Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “Asvspoof: the automatic speaker verification spoofing and countermeasures challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, 2017.
- Y. Zang, Y. Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160.
- D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” in 2018 IEEE international workshop on information forensics and security (WIFS). IEEE, 2018, pp. 1–7.
- A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1–11.
- Y. Mirsky and W. Lee, “The creation and detection of deepfakes: A survey,” ACM computing surveys (CSUR), vol. 54, no. 1, pp. 1–41, 2021.
- L. Lin, N. Gupta, Y. Zhang, H. Ren, C.-H. Liu, F. Ding, X. Wang, X. Li, L. Verdoliva, and S. Hu, “Detecting multimedia generated by large ai models: A survey,” arXiv preprint arXiv:2402.00045, 2024.
- A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu et al., “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, vol. 12, 2016.
- K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
- J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in neural information processing systems, vol. 33, pp. 17 022–17 033, 2020.
- P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, 2020.
- M. Pasini and J. Schlüter, “Musika! fast infinite waveform music generation,” in Ismir 2022 Conference, 2022.
- N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
- A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
- R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” Advances in neural information processing systems, vol. 32, 2019.
- M. Defferrard, S. P. Mohanty, S. F. Carroll, and M. Salathé, “Learning to recognize musical genre from audio,” in The 2018 Web Conference Companion. ACM Press, 2018. [Online]. Available: https://arxiv.org/abs/1803.05337
- D. Cooke, A. Edwards, S. Barkoff, and K. Kelly, “As good as a coin toss human detection of ai-generated images, videos, audio, and audiovisual stimuli,” arXiv preprint arXiv:2403.16760, 2024.
- Ars Technica, “Why ai detectors think the us constitution was written by ai,” https://arstechnica.com/information-technology/2023/07/why-ai-detectors-think-the-us-constitution-was-written-by-ai/, 2023.
- C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
- S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn-generated images are surprisingly easy to spot… for now,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8695–8704.
- Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale challenging dataset for deepfake forensics,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3207–3216.
- N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 37th annual Allerton Conference on Communication, Control, and Computing, 2000.
- L. Scimeca, S. J. Oh, S. Chun, M. Poli, and S. Yun, “Which shortcut cues will dnns choose? a study from the parameter-space perspective,” in International Conference on Learning Representations, 2021.
- A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 427–436.
- C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International conference on machine learning. PMLR, 2017, pp. 1321–1330.
- T. Lombrozo, “The structure and function of explanations,” Trends in cognitive sciences, vol. 10, no. 10, pp. 464–470, 2006.
- J. Adebayo, M. Muelly, H. Abelson, and B. Kim, “Post hoc explanations may be ineffective for detecting unknown spurious correlation,” in International Conference on Learning Representations, 2022.
- T. Miller, “Explainable ai is dead, long live explainable ai! hypothesis-driven decision support using evaluative ai,” in ACM Conference on Fairness, Accountability, and Transparency (FAccT), ser. FAccT ’23. ACM, 2023, p. 333–342. [Online]. Available: https://doi.org/10.1145/3593013.3594001
- B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. sayres, “Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV),” in International Conference on Machine Learning (ICML). PMLR, 2018, pp. 2668–2677.
- D. Afchar, R. Hennequin, and V. Guigue, “Learning unsupervised hierarchies of audio concepts,” in Ismir 2022 Hybrid Conference, 2022.
- F. Foscarin, K. Hoedt, V. Praher, A. Flexer, and G. Widmer, “Concept-based techniques for" musicologist-friendly" explanations in deep music classifiers,” in Ismir 2022 Hybrid Conference, 2022.
- D. Das, P. van Boheemen, N. Linda, J. Jahnel, M. Karaboga, M. Fatun, and M. Huijstee, “Tackling deepfakes in european policy,” European Parliament, Tech. Rep., 07 2021.
- A. Miotti and A. Wasil, “Combatting deepfakes: Policies to address national security threats and rights violations,” arXiv preprint arXiv:2402.09581, 2024.
- G. Chen, Y. Wu, S. Liu, T. Liu, X. Du, and F. Wei, “Wavmark: Watermarking for audio generation,” arXiv preprint arXiv:2308.12770, 2023.
- M. Hardt, M. Jagadeesan, and C. Mendler-Dünner, “Performative power,” Advances in Neural Information Processing Systems, vol. 35, pp. 22 969–22 981, 2022.
- D. Afchar, “Interpretable music recommender systems,” Ph.D. dissertation, Sorbonne Université, 2023.
- B. Green, “Data science as political action: Grounding data science in a politics of justice,” Journal of Social Computing, vol. 2, no. 3, pp. 249–265, 2021.