Papers
Topics
Authors
Recent
2000 character limit reached

In-the-Wild Audio Deepfake Dataset

Updated 13 January 2026
  • In-the-Wild Audio Deepfake Dataset is a corpus combining authentic and manipulated audio from uncontrolled, real-world sources to enhance detection generalization.
  • The dataset reflects a wide range of acoustic variability, including diverse recording conditions, languages, and generative methods from contemporary social platforms.
  • Rigorous curation protocols—featuring speaker-disjoint splits, manual and automated annotations, and cross-domain evaluations—demonstrate its utility for realistic deepfake detection.

An In-the-Wild Audio Deepfake Dataset is a corpus of audio files containing both genuine and deepfake (synthetically generated or manipulated) content, collected from uncontrolled real-world sources (e.g., social media, streaming platforms, broadcast media) rather than from laboratory settings. Such datasets are uniquely valuable for developing and benchmarking generalizable audio deepfake detection systems, as they expose models to diverse recording conditions, noise profiles, speakers, languages, and attack types not encountered in conventional “in-domain” benchmarks.

1. Definitional Scope and Dataset Landscape

An "in-the-wild" audio deepfake dataset comprises bona fide and manipulated audio segments that originate from unconstrained, real-world contexts. Sources include YouTube, TikTok, podcasts, live streams, social media accounts, and crowd-sourced fact-checking platforms. Unlike curated laboratory datasets—which use fixed scripts, controlled microphones, and known spoofing systems—in-the-wild corpora reflect naturally occurring acoustic variability (background noise, channel effects, codec artifacts), spontaneous content, broad speaker demographics, and a mixture of contemporary generative methods, often unknown or evolving. These datasets aim to support rigorous “cross-domain” or “open-world” evaluation, revealing the domain-shift vulnerabilities of existing detectors (Müller et al., 2022, Wang et al., 4 Sep 2025, Ciobanu et al., 31 May 2025, Yi et al., 2023, Shahriar, 10 Jan 2026, Xie et al., 14 Aug 2025).

Released datasets exemplifying this paradigm include the “In-the-Wild” set of Müller et al. (Müller et al., 2022), Fake Speech Wild (Xie et al., 14 Aug 2025), Deepfake-Eval-2024 (Chandra et al., 4 Mar 2025), AI4T (Combei et al., 11 Jun 2025), XMAD-Bench (Ciobanu et al., 31 May 2025), ADD 2023 (Yi et al., 2024), and SpoofCeleb (Jung et al., 2024). The scale, language coverage, and underlying synthesis techniques used by these corpora vary, but they are united by their lack of controlled channel, scripted speech, or fixed TTS-source overlap between train and test splits.

2. Data Sources, Composition, and Curation Protocols

In-the-wild audio deepfake datasets are sourced from platforms with ongoing deepfake audio activity and real-world usage. For instance, the ITW corpus (Müller et al., 2022) contains 37.9 hours of audio (20.7 hours real, 17.2 hours fake) with 19,963 authentic and 11,816 fake utterances from 58 English-speaking public figures. Audio is extracted from public speeches, interviews, social media uploads, and verified deepfake videos, maintaining natural variability in audio quality and speaking style. User-submitted or automatically flagged media (e.g., TrueMedia, X Community Notes, fact-checking bots) broaden the data pool and introduce hard-to-discriminate edge cases (Chandra et al., 4 Mar 2025).

Data curation involves deduplication, silence removal, segmentation (using VAD, e.g., pyannote), manual and automated annotation, and strict speaker-disjoint splitting. Annotation efforts blend forensic inspection, corroboration via commercial detectors, and in some cases, crowd-sourcing or multi-stage review (Chandra et al., 4 Mar 2025, Xie et al., 14 Aug 2025). For speaker-disjoint protocols, no speaker appears in more than one split, preventing models from relying on identity cues (Shahriar, 10 Jan 2026).

A representative summary of key dataset parameters is shown below:

Dataset # Samples Hours (Real/Fake) # Speakers Langs Source Domains Gen. Methods
ITW (Müller et al., 2022) ~31,800 20.7/17.2 58 En social media, streaming, news TTS/VC, unknown in detail
FSW (Xie et al., 14 Aug 2025) 146,097 117.69/136.89 128 Zh Bilibili, YouTube, Douyin, Ximalaya Community-published TTS/VC
Deepfake-Eval-24 (Chandra et al., 4 Mar 2025) 1,820 56.5 total N/A 42 social media, TrueMedia Manual/forensic flagging
SpoofCeleb (Jung et al., 2024) >2.6M N/A 1,251 En VoxCeleb1 23 TTS pipelines

3. Deepfake Generation Techniques and Diversity

Unlike "in-lab" datasets that use fixed or known TTS/VC engines, in-the-wild datasets aggregate synthetic content from multiple, often opaque, tools in current social use (Müller et al., 2022, Xie et al., 14 Aug 2025). Deepfake samples may be created with commercial, open-source, or private TTS engines (e.g., WaveNet, Tacotron2, FastSpeech2, VALL-E, Bark, XTTS v2, YourTTS, voice conversion pipelines) and, in certain cases, via user-modified or hybrid models not documented in research literature. Spoofing methods and attack sophistication therefore drift over time, tracking open-source releases or advances in commercial voice cloning.

No additional adversarial attacks are typically synthesized by dataset authors; instead, data are "found" with any manipulations imposed at creation or through social-platform distribution (e.g., lossy codecs, channel effects, re-recording) (Müller et al., 2022, Xie et al., 14 Aug 2025). For non-speech deepfakes (environmental or musical), deepfake samples are generated by contemporary text-to-audio or audio-to-audio models (e.g., AudioLDM, AudioGen, DiffSinger, so-vits-svc) operating on real-world prompts or human recordings (Zang et al., 2023, Yin et al., 25 May 2025).

4. Evaluation Protocols and Detection Benchmarks

Detection performance is typically measured using Equal Error Rate (EER), area under the ROC curve (AUC), precision/recall, and, for some challenges, more granular tasks such as region-level localization (framewise F1) or algorithm-attribution (macro F1) (Müller et al., 2022, Yi et al., 2024, Shahriar, 10 Jan 2026, Chandra et al., 4 Mar 2025, Xie et al., 14 Aug 2025). EER is formally defined as the threshold τ\tau^* where FAR(τ)=FRR(τ)\mathrm{FAR}(\tau^*) = \mathrm{FRR}(\tau^*); AUC is the integral of the TPR over the FPR domain.

Speaker-disjoint splits are used to suppress identity leakage, ensuring detectors must learn manipulation artifacts rather than speaker-specific features. For instance, in (Shahriar, 10 Jan 2026), utterances are split so that no speaker's voice appears in both training and test sets.

Cross-dataset and cross-domain evaluation reveal severe generalization gaps. Models achieving sub-1 % EER on controlled datasets (ASVspoof, ADD) can degrade to 30–60 % EER when deployed on in-the-wild corpora (Müller et al., 2022, Yi et al., 2023, Ciobanu et al., 31 May 2025, Ahmadiadli et al., 10 May 2025). Best-in-class results on modern benchmarks are reported for sophisticated architectures—SSL front-ends (wav2vec2/XLS-R, Whisper, BEATs), graph-neural or convolutional back-ends (AASIST, RawNet2, AFN)—often augmented by data-centric strategies (culling, targeted augmentation, loss reweighting) (Combei et al., 11 Jun 2025, Ahmadiadli et al., 10 May 2025, Shahriar, 10 Jan 2026).

System (Trained In-Domain) EER (In-the-Wild) EER (Source Domain)
RawNet2 ~34 % 4 % (ASVspoof19-LA)
XLS-R + ASSERT ~41 % 21 % (ASVspoof21-DF)
Best ADM (freq swap) ~26 % <5 % (lab data)
XLS-R-AASIST+FSW (Xie et al., 14 Aug 2025) 11.6-17% <1 % (in-domain sets)
AUDETER-trained XLS-R+SLS (Wang et al., 4 Sep 2025) 4.17% N/A

This gap confirms the necessity of in-the-wild benchmarks for realistic assessment.

5. Challenges: Domain Shifts, Artifacts, and Robustness

Major technical challenges in in-the-wild audio deepfake detection derive from:

  • Domain and channel mismatch: Unconstrained recording devices, codecs, and background environments substantially increase the false positive/negative rate of detectors compared to lab-controlled corpora (Müller et al., 2022, Wang et al., 4 Sep 2025, Chandra et al., 4 Mar 2025).
  • Spoofing method diversity and drift: Out-of-distribution generative models and rapid advances in “zero-shot” and neural-codec TTS yield synthetic speech with indistinguishable artifacts, leading to catastrophic performance decay for in-domain-tuned models (Li et al., 2024, Ciobanu et al., 31 May 2025).
  • Language and cultural coverage: Non-English content, under-represented accents, and regional dialects increase error rates and are under-sampled in earlier corpora (Ciobanu et al., 31 May 2025, Xie et al., 14 Aug 2025).
  • Inherent uncertainty and labeling ambiguity: Human annotators themselves exhibit non-trivial confusion rates, especially for short, noisy, or heavily compressed clips (Chandra et al., 4 Mar 2025).
  • Background contamination: Music, overlapping voices, synthetic reverberation, and transmission artifacts mask the subtle cues that distinguish bona fide from fake (Xie et al., 14 Aug 2025, Zang et al., 2023).

6. Design Patterns and Recommendations for Future Datasets

To address these challenges and establish robust benchmarks, the literature prescribes:

7. Impact, Benchmarking Standards, and Outlook

In-the-wild audio deepfake datasets have revealed that published detectors trained solely in laboratory conditions fail to generalize robustly, with observed EER increases of 500–1000 % relative to benchmark test sets (Müller et al., 2022, Yi et al., 2023). Modern benchmarks (e.g., AUDETER, XMAD-Bench, ADD 2023, Deepfake-Eval-2024, FSW, SpoofCeleb) now serve as the primary evaluation targets for cross-domain and open-world generalization (Wang et al., 4 Sep 2025, Ciobanu et al., 31 May 2025, Jung et al., 2024, Chandra et al., 4 Mar 2025).

Accurate generalization requires detectors to suppress identity, language, and channel biases, leverage artifact-focused representation learning, and remain resilient under shifts in generative or channel distributions (Ahmadiadli et al., 10 May 2025). Data-centric methodologies—pruning, mixing, targeted augmentation—have enabled up to 55 % reductions in EER on in-the-wild sets, offering more tangible progress than “blind scaling” of model complexity (Combei et al., 11 Jun 2025).

The field is trending toward continual-data expansion, explicit handling of multilingual and cross-class attacks, joint detection-localization-attribution protocols, and shared open benchmarks to advance the real-world applicability of audio deepfake detection systems.

Citations:

(Müller et al., 2022, Yi et al., 2023, Chandra et al., 4 Mar 2025, Ciobanu et al., 31 May 2025, Combei et al., 11 Jun 2025, Jung et al., 2024, Xie et al., 14 Aug 2025, Wang et al., 4 Sep 2025, Shahriar, 10 Jan 2026, Müller et al., 2024, Ahmadiadli et al., 10 May 2025, Zang et al., 2023, Yi et al., 2024, Yin et al., 25 May 2025, Li et al., 2024)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to In-the-Wild Audio Deepfake Dataset.