AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

Published 9 Apr 2026 in cs.SD and cs.AI | (2604.08184v1)

Abstract: The rapid advancement of Audio LLMs (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper introduces a new benchmark that evaluates both speech and all-type audio deepfake detection under realistic distortions.
It details a dual-track evaluation protocol using extensive datasets from over 40 speech models and 70 multi-type generators.
Key results show SSL-based and ALLM methods significantly outperform conventional approaches, advancing universal detection.

AT-ADD: Design and Protocols for All-Type Audio Deepfake Detection Benchmarking

Motivation and Problem Statement

The proliferation of Audio LLMs (ALLMs) has enabled high-fidelity, scalable synthesis of diverse audio content including speech, non-speech sounds, singing voices, and music. This expansion exposes significant challenges in audio forensics: existing audio deepfake detection (ADD) systems and benchmarks are typically speech-centric, over-rely on artifacts specific to speech synthesis, and fail to provide robustness or generalization to a wide spectrum of realistic distortions and out-of-domain audio types. There remains a critical gap in countermeasures (CMs) that can generalize robustly both to varied signal perturbations and to heterogeneous, previously unseen categories of audio manipulated with state-of-the-art methodologies.

AT-ADD Challenge Structure

The AT-ADD Grand Challenge is formalized as a two-track evaluation protocol to systematically advance the field:

Track 1 (Robust Speech Deepfake Detection): Focuses on reliability under real-world channel conditions, strong domain shift (e.g., diverse recording devices, environments, languages), realistic degradations (e.g., reverberation, noise, replay, compression, pitch/speed perturbation), and resilience to fake speech from >40 generators, including vocoder, neural codec, and diffusion-based paradigms with an emphasis on ALLM-derived approaches. The evaluation set is composed of unseen generation methods to stress the OOD generalization capacity of submissions.
Track 2 (All-Type Audio Deepfake Detection): Extends the task to encompass all major audio categories—speech, environmental sound, singing, and music—with no type indication provided at inference. Track 2 uses >70 audio generation models, requiring type-agnostic models that generalize to unseen classes and generators. The protocol emphasizes capturing universal synthesis artifacts rather than overfitting to speech-specific patterns.

Across both tracks, a closed data setting is strictly enforced: only challenge-provided data is allowed for training or adaptation, ensuring a fair, controlled comparison of CM generalization and robustness capabilities.

Datasets and Task Specification

The challenge provides two extensive, high-diversity datasets:

Track 1: Samples from >40 modern TTS and VC architectures, using strong domain variability for real data and complex perturbation protocols to simulate realistic deployments. Both training/dev and eval splits are constructed with non-overlapping speaker, text, and generator sets to test generalization and prevent label leakage.
Track 2: Integrates data from major sound, singing, and music corpora, and corresponding generation models (e.g., AudioLDM, MusicGen, Stable Audio Open, etc.) with careful split for OOD evaluation. Fake samples are deterministically generated from textual or reference conditionings to avoid confounds, while OOD real data evaluates cross-dataset transfer.

In both tracks, the core binary classification task is to predict real vs. fake for a given audio sample, with Track 2 omitting type information at test time to enforce universal modeling.

Baselines and Evaluation

Official baselines encompass:

Conventional CMs: E.g., Spec-ResNet (spectral features with standard ResNet backbone), AASIST (waveform-based, using sinc convolution and attention), offering reference points for non-SSL approaches.
SSL-based CMs: FT-XLSR-AASIST (full fine-tuning of XLSR as feature frontend and AASIST as backend), and WPT-XLSR-AASIST (adds wavelet prompt tuning for frequency invariance). These models consistently outperform conventional methods on both robustness and generalization.
ALLM-based CMs: E.g., Qwen2.5-Omni (multimodal LLMs with audio input, finetuned for binary ADD), which provide competitive results despite their generic design and potential advantages for interpretability and unified modeling.

Key results: FT-XLSR-AASIST achieves the highest macro-F1 across both tracks ( $>$ 76% speech, $>$ 79% all-type), with ALLM-based approaches showing robust, balanced performance across types ( $\sim$ 62-69% macro-F1). Performance for conventional baselines on Track 1/2 ranges 47–62% (macro-F1), aligning with established gaps to SSL/ALLM CMs.

Track 2 demonstrates type-dependent difficulty: singing voice detection approaches parity with speech, while sound and music detection remain more challenging due to their greater acoustic variability and less mature benchmarks. This underscores the need for further research into type-agnostic ADD, as existing methods do not fully unify cross-type generalization.

Evaluation Metrics and Competition Protocol

Evaluation employs per-class and per-type macro-F1, mitigating class/type imbalance and enforcing strict balance in Track 2 via a two-level aggregation. The closed setting prohibits external data, self-training, or augmentation beyond signal-level perturbations (e.g., additive noise, RIR convolution) applied to official data. Submissions must be fully reproducible and limited to ensembles of 5 or fewer systems.

Implications and Future Directions

AT-ADD establishes a rigorous, scalable, and unified benchmark for robust and generalizable audio deepfake detection. By enforcing closed evaluation protocols and including strong OOD splits for both generators and content types, AT-ADD sets a new empirical standard for advancing CMs beyond speech-centric, artifact-driven detection.

Implications include:

Practical Forensics: Progress toward reliable deployment in media verification, authentication, and evidence analysis, especially as wide-band and multimodal fakes become pervasive.
Theoretical Generalization: The structure and difficulty of Track 2 in particular reveal open challenges in universal representation learning and generalization across heterogeneous acoustic phenomena, which will require fundamental innovations—potentially in multimodal/ALLM prompt tuning, type-invariant SSL, or novel regularization schemes.
AI Safety and Governance: Robust, type-agnostic ADD is essential for supporting regulatory and authentication frameworks, given the proliferation of synthetic media in critical infrastructure and information flows.

Future work is expected to focus on: synthesizer-agnostic modeling (e.g., improved handling of neural codec/diffusion artifacts), interpretable reasoning (as motivated by ALLM baselines), foundation modeling for type-universal audio understanding, and standardized, scalable evaluation protocols.

Conclusion

The AT-ADD Challenge provides the most comprehensive and operationally realistic benchmarking protocol for audio deepfake detection. By decoupling robustness and generalization dimensions, providing large-scale, heterogeneously sourced datasets, and aligning metrics and rules for strict comparability, the challenge fosters progress toward deployable, type-universal forensic CMs and will likely serve as a reference point for both academic and industrial audio integrity research.

Markdown Report Issue