Does Audio Deepfake Detection Generalize? (2203.16263v4)

Published 30 Mar 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research. While researchers have presented various techniques for detecting audio spoofs, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental? In this work, we address this problem: We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We identify overarching features for successful audio deepfake detection, such as using cqtspec or logspec features instead of melspec features, which improves performance by 37% EER on average, all other factors constant. Additionally, we evaluate generalization capabilities: We collect and publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. We find that related work performs poorly on such real-world data (performance degradation of up to one thousand percent). This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.

PDF Abstract

Analyzing the Generalization Capabilities of Audio Deepfake Detection Models

The paper "Does Audio Deepfake Detection Generalize?" addresses a pertinent challenge in the field of audio spoof detection, specifically focusing on the generalization capabilities of current deep learning models. As the sophistication of text-to-speech (TTS) systems advances, the ability to create highly realistic audio deepfakes becomes more feasible, posing risks of misinformation and potential misuse. This paper investigates whether existing detection systems can effectively identify such deepfakes beyond controlled environments using standardized datasets such as ASVspoof.

Methodology and Findings

The researchers embark on a comprehensive evaluation approach by re-implementing and assessing various prominent model architectures from audio spoof detection literature. This methodical re-evaluation aims to disentangle the factors contributing to model success across a uniform testing regime. Notably, the paper identifies that the selection of features such as constant-Q transform spectrograms (cqtspec) and log spectrograms (logspec) significantly enhances performance, improving the error rates by 37% compared to mel-scaled spectrograms (melspec).

In a robust evaluation framework, the paper employs a suite of architectures, including LSTM-based models, LCNNs, MesoNets, ResNet18, Transformer models, and several end-to-end raw audio models equipped with graph attention networks and Sinc layers. The model performances are assessed on both the ASVspoof 2019 Logical Access dataset and a newly curated in-the-wild dataset consisting of real-world deepfakes from public figures. This in-the-wild dataset encompasses 37.9 hours of recordings with a division between authentic and deepfake samples, offering a more realistic gauge of model effectiveness beyond laboratory conditions.

Generalization Insights

The empirical results illuminate a stark contrast between the in-domain performance on ASVspoof data and out-of-domain performance on real-world deepfake data. Most models experience a marked degradation, with error rates increasing by 200 to 1000 percent. This suggests a potential overfitting to the characteristics unique to the ASVspoof dataset, which primarily features synthetic utterances created in controlled settings. The paper highlights that models may not generalize effectively to unseen, realistic audio attacks outside these benchmarks.

Furthermore, the investigation reveals that truncating audio inputs to a fixed length of four seconds compromises model performance, advocating for the use of full-length recordings to preserve critical context necessary for accurate detection. This insight challenges some existing practices in the field, encouraging a reassessment of preprocessing stages to improve model robustness.

Implications and Future Directions

The conclusions drawn from this paper underscore the critical need for improving the generalization capabilities of audio deepfake detection systems. As the technology for generating audio deepfakes is advancing rapidly, it is imperative that detection mechanisms remain agile and adaptable to diverse and dynamic real-world scenarios. The findings advocate for a reassessment of the datasets used for training and testing detection systems, urging the community to develop canonic datasets that better represent a range of synthetic audio generation techniques.

In terms of future work, the exploration of novel architectures specifically designed to generalize across diverse audio synthesis methods could be beneficial. Additionally, incorporating semi-supervised learning strategies or domain adaptation techniques might offer pathways to improve model performance on realistic data. This research calls for ongoing efforts to ensure that technological measures remain proportional in capability to the rapidly advancing threats posed by audio deepfakes. The field must continue evolving to maintain the integrity and trustworthiness of digital audio content in an era of increasingly sophisticated digital forgeries.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Nicolas M. Müller (16 papers)
Pavel Czempin (4 papers)
Franziska Dieckmann (3 papers)
Adam Froghyar (1 paper)
Konstantin Böttinger (28 papers)

Citations (112)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos