Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception (2504.06753v1)

Published 9 Apr 2025 in cs.SD and cs.AI

Abstract: The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the alltype ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL frontend by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types,we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets. The code is available online.

Authors (8)

Yuankun Xie (19 papers)
Ruibo Fu (54 papers)
Zhiyong Wang (120 papers)
Xiaopeng Wang (53 papers)
Songjun Cao (15 papers)
Long Ma (116 papers)
Haonan Cheng (7 papers)
Long Ye (14 papers)

Summary

Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

This paper addresses the challenge of detecting all types of deepfake audio, which includes speech, sound, singing voice, and music, using a novel method termed Wavelet Prompt Tuning (WPT). Current audio deepfake detection methods often excel in identifying deepfakes within a single audio type but struggle to generalize across multiple types. This research establishes a benchmark for evaluating the performance of countermeasures (CMs) across different audio types, an undertaking not formerly attempted in this depth.

The authors introduce the Prompt Tuning Self-Supervised Learning (PT-SSL) training paradigm designed to optimize self-supervised learning (SSL) models for the all-type audio deepfake detection task. The paradigm focuses on learning specialized prompt tokens, which significantly reduce the number of trainable parameters compared to traditional fine-tuning. Particularly, the WPT method leverages the discrete wavelet transform to better capture type-invariant auditory patterns from the frequency domain, improving the model's performance without additional parameter complexities.

A key advancement presented is the capability to address cross-type audio deepfake detection efficiently. Using the WPT-SSL-AASIST approach, the authors report that their model achieved an average Equal Error Rate (EER) of 3.58\% across all evaluation sets, highlighting its robust performance against diverse audio deepfake challenges. This result underscores the method's potential as a universal CM for audio deepfakes.

The paper's contribution is significant in the context of developing a cohesive framework for detecting deepfake audio across varying domains. The application of wavelet techniques to enhance prompt tuning is a novel concept in detecting universal auditory discrepancies within synthesized sounds. The authors suggest that by focusing on frequency-domain invariance, their method could serve as a foundation for future research in generalizable audio deepfake detection solutions.

Moreover, the implications of this research stretch into practical domains such as media security, where the need to secure communications against unauthorized auditory fabrications is increasingly vital. The work posits future developments that may involve extending this approach to other modalities using similar prompt tuning adaptations.

In summary, the paper makes a compelling case for the wavelet-enhanced approach to audio deepfake detection, presenting a significant stride towards universal solutions for safeguarding auditory media content.

Related Papers

Find Related Papers

YouTube

Show All Videos