In-the-Wild Audio Deepfake Dataset
- In-the-Wild Audio Deepfake Dataset is a corpus combining authentic and manipulated audio from uncontrolled, real-world sources to enhance detection generalization.
- The dataset reflects a wide range of acoustic variability, including diverse recording conditions, languages, and generative methods from contemporary social platforms.
- Rigorous curation protocols—featuring speaker-disjoint splits, manual and automated annotations, and cross-domain evaluations—demonstrate its utility for realistic deepfake detection.
An In-the-Wild Audio Deepfake Dataset is a corpus of audio files containing both genuine and deepfake (synthetically generated or manipulated) content, collected from uncontrolled real-world sources (e.g., social media, streaming platforms, broadcast media) rather than from laboratory settings. Such datasets are uniquely valuable for developing and benchmarking generalizable audio deepfake detection systems, as they expose models to diverse recording conditions, noise profiles, speakers, languages, and attack types not encountered in conventional “in-domain” benchmarks.
1. Definitional Scope and Dataset Landscape
An "in-the-wild" audio deepfake dataset comprises bona fide and manipulated audio segments that originate from unconstrained, real-world contexts. Sources include YouTube, TikTok, podcasts, live streams, social media accounts, and crowd-sourced fact-checking platforms. Unlike curated laboratory datasets—which use fixed scripts, controlled microphones, and known spoofing systems—in-the-wild corpora reflect naturally occurring acoustic variability (background noise, channel effects, codec artifacts), spontaneous content, broad speaker demographics, and a mixture of contemporary generative methods, often unknown or evolving. These datasets aim to support rigorous “cross-domain” or “open-world” evaluation, revealing the domain-shift vulnerabilities of existing detectors (Müller et al., 2022, Wang et al., 4 Sep 2025, Ciobanu et al., 31 May 2025, Yi et al., 2023, Shahriar, 10 Jan 2026, Xie et al., 14 Aug 2025).
Released datasets exemplifying this paradigm include the “In-the-Wild” set of Müller et al. (Müller et al., 2022), Fake Speech Wild (Xie et al., 14 Aug 2025), Deepfake-Eval-2024 (Chandra et al., 4 Mar 2025), AI4T (Combei et al., 11 Jun 2025), XMAD-Bench (Ciobanu et al., 31 May 2025), ADD 2023 (Yi et al., 2024), and SpoofCeleb (Jung et al., 2024). The scale, language coverage, and underlying synthesis techniques used by these corpora vary, but they are united by their lack of controlled channel, scripted speech, or fixed TTS-source overlap between train and test splits.
2. Data Sources, Composition, and Curation Protocols
In-the-wild audio deepfake datasets are sourced from platforms with ongoing deepfake audio activity and real-world usage. For instance, the ITW corpus (Müller et al., 2022) contains 37.9 hours of audio (20.7 hours real, 17.2 hours fake) with 19,963 authentic and 11,816 fake utterances from 58 English-speaking public figures. Audio is extracted from public speeches, interviews, social media uploads, and verified deepfake videos, maintaining natural variability in audio quality and speaking style. User-submitted or automatically flagged media (e.g., TrueMedia, X Community Notes, fact-checking bots) broaden the data pool and introduce hard-to-discriminate edge cases (Chandra et al., 4 Mar 2025).
Data curation involves deduplication, silence removal, segmentation (using VAD, e.g., pyannote), manual and automated annotation, and strict speaker-disjoint splitting. Annotation efforts blend forensic inspection, corroboration via commercial detectors, and in some cases, crowd-sourcing or multi-stage review (Chandra et al., 4 Mar 2025, Xie et al., 14 Aug 2025). For speaker-disjoint protocols, no speaker appears in more than one split, preventing models from relying on identity cues (Shahriar, 10 Jan 2026).
A representative summary of key dataset parameters is shown below:
| Dataset | # Samples | Hours (Real/Fake) | # Speakers | Langs | Source Domains | Gen. Methods |
|---|---|---|---|---|---|---|
| ITW (Müller et al., 2022) | ~31,800 | 20.7/17.2 | 58 | En | social media, streaming, news | TTS/VC, unknown in detail |
| FSW (Xie et al., 14 Aug 2025) | 146,097 | 117.69/136.89 | 128 | Zh | Bilibili, YouTube, Douyin, Ximalaya | Community-published TTS/VC |
| Deepfake-Eval-24 (Chandra et al., 4 Mar 2025) | 1,820 | 56.5 total | N/A | 42 | social media, TrueMedia | Manual/forensic flagging |
| SpoofCeleb (Jung et al., 2024) | >2.6M | N/A | 1,251 | En | VoxCeleb1 | 23 TTS pipelines |
3. Deepfake Generation Techniques and Diversity
Unlike "in-lab" datasets that use fixed or known TTS/VC engines, in-the-wild datasets aggregate synthetic content from multiple, often opaque, tools in current social use (Müller et al., 2022, Xie et al., 14 Aug 2025). Deepfake samples may be created with commercial, open-source, or private TTS engines (e.g., WaveNet, Tacotron2, FastSpeech2, VALL-E, Bark, XTTS v2, YourTTS, voice conversion pipelines) and, in certain cases, via user-modified or hybrid models not documented in research literature. Spoofing methods and attack sophistication therefore drift over time, tracking open-source releases or advances in commercial voice cloning.
No additional adversarial attacks are typically synthesized by dataset authors; instead, data are "found" with any manipulations imposed at creation or through social-platform distribution (e.g., lossy codecs, channel effects, re-recording) (Müller et al., 2022, Xie et al., 14 Aug 2025). For non-speech deepfakes (environmental or musical), deepfake samples are generated by contemporary text-to-audio or audio-to-audio models (e.g., AudioLDM, AudioGen, DiffSinger, so-vits-svc) operating on real-world prompts or human recordings (Zang et al., 2023, Yin et al., 25 May 2025).
4. Evaluation Protocols and Detection Benchmarks
Detection performance is typically measured using Equal Error Rate (EER), area under the ROC curve (AUC), precision/recall, and, for some challenges, more granular tasks such as region-level localization (framewise F1) or algorithm-attribution (macro F1) (Müller et al., 2022, Yi et al., 2024, Shahriar, 10 Jan 2026, Chandra et al., 4 Mar 2025, Xie et al., 14 Aug 2025). EER is formally defined as the threshold where ; AUC is the integral of the TPR over the FPR domain.
Speaker-disjoint splits are used to suppress identity leakage, ensuring detectors must learn manipulation artifacts rather than speaker-specific features. For instance, in (Shahriar, 10 Jan 2026), utterances are split so that no speaker's voice appears in both training and test sets.
Cross-dataset and cross-domain evaluation reveal severe generalization gaps. Models achieving sub-1 % EER on controlled datasets (ASVspoof, ADD) can degrade to 30–60 % EER when deployed on in-the-wild corpora (Müller et al., 2022, Yi et al., 2023, Ciobanu et al., 31 May 2025, Ahmadiadli et al., 10 May 2025). Best-in-class results on modern benchmarks are reported for sophisticated architectures—SSL front-ends (wav2vec2/XLS-R, Whisper, BEATs), graph-neural or convolutional back-ends (AASIST, RawNet2, AFN)—often augmented by data-centric strategies (culling, targeted augmentation, loss reweighting) (Combei et al., 11 Jun 2025, Ahmadiadli et al., 10 May 2025, Shahriar, 10 Jan 2026).
| System (Trained In-Domain) | EER (In-the-Wild) | EER (Source Domain) |
|---|---|---|
| RawNet2 | ~34 % | 4 % (ASVspoof19-LA) |
| XLS-R + ASSERT | ~41 % | 21 % (ASVspoof21-DF) |
| Best ADM (freq swap) | ~26 % | <5 % (lab data) |
| XLS-R-AASIST+FSW (Xie et al., 14 Aug 2025) | 11.6-17% | <1 % (in-domain sets) |
| AUDETER-trained XLS-R+SLS (Wang et al., 4 Sep 2025) | 4.17% | N/A |
This gap confirms the necessity of in-the-wild benchmarks for realistic assessment.
5. Challenges: Domain Shifts, Artifacts, and Robustness
Major technical challenges in in-the-wild audio deepfake detection derive from:
- Domain and channel mismatch: Unconstrained recording devices, codecs, and background environments substantially increase the false positive/negative rate of detectors compared to lab-controlled corpora (Müller et al., 2022, Wang et al., 4 Sep 2025, Chandra et al., 4 Mar 2025).
- Spoofing method diversity and drift: Out-of-distribution generative models and rapid advances in “zero-shot” and neural-codec TTS yield synthetic speech with indistinguishable artifacts, leading to catastrophic performance decay for in-domain-tuned models (Li et al., 2024, Ciobanu et al., 31 May 2025).
- Language and cultural coverage: Non-English content, under-represented accents, and regional dialects increase error rates and are under-sampled in earlier corpora (Ciobanu et al., 31 May 2025, Xie et al., 14 Aug 2025).
- Inherent uncertainty and labeling ambiguity: Human annotators themselves exhibit non-trivial confusion rates, especially for short, noisy, or heavily compressed clips (Chandra et al., 4 Mar 2025).
- Background contamination: Music, overlapping voices, synthetic reverberation, and transmission artifacts mask the subtle cues that distinguish bona fide from fake (Xie et al., 14 Aug 2025, Zang et al., 2023).
6. Design Patterns and Recommendations for Future Datasets
To address these challenges and establish robust benchmarks, the literature prescribes:
- Speaker-disjoint and cross-model partitioning: Enforce non-overlapping speaker, synthesis method, and media source pools across splits (Shahriar, 10 Jan 2026, Ciobanu et al., 31 May 2025, Wang et al., 4 Sep 2025).
- Large-scale, language- and genre-diverse coverage: Include multiple languages, dialects, and new classes (e.g., environmental sounds, singing voice, cross-lingual attacks) (Müller et al., 2024, Ciobanu et al., 31 May 2025, Yin et al., 25 May 2025).
- Rich metadata and transparent annotation: Provide detailed labels on source, generative pipeline, compression, and signal quality where feasible (Jung et al., 2024, Chandra et al., 4 Mar 2025).
- Region-level and multi-label annotation: Annotate manipulated segments, attribution to attack class, and “partially fake” constructs to support beyond-binary detection and forensic tasks (Yi et al., 2024, Zang et al., 2023).
- Continuous update and adaptive augmentation: Dynamic inclusion of emerging attack vectors and data-centric pipeline augmentation (noise, codec, pruning) to mirror evolving threat models (Combei et al., 11 Jun 2025, Li et al., 2024).
- Open, standardized evaluation protocols: Encourage shared metrics (EER, AUC, min t-DCF, framewise F1), public codebases, and multi-site cross-dataset studies for reproducibility (Yi et al., 2023, Shahriar, 10 Jan 2026).
7. Impact, Benchmarking Standards, and Outlook
In-the-wild audio deepfake datasets have revealed that published detectors trained solely in laboratory conditions fail to generalize robustly, with observed EER increases of 500–1000 % relative to benchmark test sets (Müller et al., 2022, Yi et al., 2023). Modern benchmarks (e.g., AUDETER, XMAD-Bench, ADD 2023, Deepfake-Eval-2024, FSW, SpoofCeleb) now serve as the primary evaluation targets for cross-domain and open-world generalization (Wang et al., 4 Sep 2025, Ciobanu et al., 31 May 2025, Jung et al., 2024, Chandra et al., 4 Mar 2025).
Accurate generalization requires detectors to suppress identity, language, and channel biases, leverage artifact-focused representation learning, and remain resilient under shifts in generative or channel distributions (Ahmadiadli et al., 10 May 2025). Data-centric methodologies—pruning, mixing, targeted augmentation—have enabled up to 55 % reductions in EER on in-the-wild sets, offering more tangible progress than “blind scaling” of model complexity (Combei et al., 11 Jun 2025).
The field is trending toward continual-data expansion, explicit handling of multilingual and cross-class attacks, joint detection-localization-attribution protocols, and shared open benchmarks to advance the real-world applicability of audio deepfake detection systems.
Citations:
(Müller et al., 2022, Yi et al., 2023, Chandra et al., 4 Mar 2025, Ciobanu et al., 31 May 2025, Combei et al., 11 Jun 2025, Jung et al., 2024, Xie et al., 14 Aug 2025, Wang et al., 4 Sep 2025, Shahriar, 10 Jan 2026, Müller et al., 2024, Ahmadiadli et al., 10 May 2025, Zang et al., 2023, Yi et al., 2024, Yin et al., 25 May 2025, Li et al., 2024)