DeepFake-Eval-2024 Benchmark

Updated 1 October 2025

DeepFake-Eval-2024 is a benchmark that aggregates real-world audio, video, and image data to evaluate the robustness of deepfake detection systems.
The evaluation reveals a marked AUC drop (45–50%) in state-of-the-art models when confronted with contemporary forgeries compared to legacy datasets.
Finetuned commercial and open-source systems show improved performance yet still lag behind human forensic analysts, emphasizing the need for adaptive detection strategies.

DeepFake-Eval-2024 is a large-scale, multi-modal benchmark that provides an in-the-wild evaluation of current deepfake detection systems using real-world media collected in 2024. It addresses critical gaps in the reliability and generalizability of existing models, highlighting the profound performance drop that occurs when academic detectors are confronted with contemporary forgeries generated and circulated across diverse platforms, languages, and manipulation pipelines (Chandra et al., 4 Mar 2025).

1. Dataset Construction and Modalities

DeepFake-Eval-2024 comprises authentic and manipulated data sourced directly from social media and a dedicated detection platform (TrueMedia.org) during 2024, making it one of the most realistic deepfake detection benchmarks to date. The corpus covers:

Video: 44 hours, encompassing the latest manipulation methodologies that reflect current threat vectors in video-based misinformation.
Audio: 56.5 hours, including both overt and subtle audio deepfakes spanning a wide linguistic range.
Images: 1,975 unique instances, targeting image-based facial manipulations and synthetic persona creation.

This dataset draws on content from 88 distinct websites and features media in 52 languages, with English comprising 78.7% of the examples. The explicit inclusion of multi-lingual, multi-platform samples mirrors the heterogeneity and operational context encountered by practical detection systems, distinguishing it from earlier curated academic sets.

2. Performance Degradation of State-of-the-Art Models

Evaluations undertaken using DeepFake-Eval-2024 reveal a marked decline in the effectiveness of standard open-source detection systems when compared to results achieved on legacy academic datasets:

Modality	Δ AUC (Academic → DeepFake-Eval-2024)	Typical Academic AUC	Typical Eval-2024 AUC
Video	–50%	≈0.96	≈0.63
Audio	–48%	≈0.95	≈0.47
Image	–45%	≈0.92	≈0.51

This pronounced AUC decline demonstrates that models trained on controlled datasets (e.g., FaceForensics++, ASVspoof, Celeb-DF) are poorly equipped to generalize to the diversity, quality, and manipulation spectra seen in the wild (Chandra et al., 4 Mar 2025). Even high-performing academic models—when directly transferred—can suffer up to 50% reduction in discriminative power.

3. Comparative Analysis: Commercial, Finetuned, and Open-Source Systems

Commercial deepfake detectors and open-source models finetuned on DeepFake-Eval-2024 achieve higher accuracy than standard, off-the-shelf academic models. For example:

Best commercial video detector: ~78% accuracy, AUC ≈0.79
Best audio system: ~89% accuracy, AUC ≈0.93
Best image system: ~82% accuracy, AUC ≈0.90

Despite this improvement, the top automated systems still do not reach the ~90% accuracy estimated for human deepfake forensic analysts. The persistent gap after targeted finetuning signals fundamental limitations in current data-driven approaches and underscores the challenge of operational, real-world deepfake detection (Chandra et al., 4 Mar 2025).

4. Domain Shift, Error Analysis, and Model Failures

The severe performance degradation observed is attributed to substantial domain shift—i.e., the statistical distribution of features (artifacts, noise, synthesis signatures) in DeepFake-Eval-2024 diverges sharply from those in legacy benchmarks. This domain shift is driven by:

Advanced Generative Methods: Adoption of diffusion models and new face/voice manipulation pipelines not represented in older data.
Diverse Media Channels: Selective, partial, and non-facial manipulations; background music and multi-language content.
Real-World Perturbations: Compression, re-encoding, silence, and environmental noise.

Error analysis from (Chandra et al., 4 Mar 2025) shows that detectors particularly struggle with artifacts from recent diffusion-based synthesis, non-facial or selective manipulations in videos, and audio deepfakes that exploit underrepresented languages or signal characteristics (e.g., music, silence padding).

5. Benchmarking Protocol, Access, and Ethical Use

DeepFake-Eval-2024 is released publicly and can be accessed at https://github.com/nuriachandra/Deepfake-Eval-2024. The dataset is provided with detailed documentation covering the data collection, filtering, and annotation process. Usage guidelines address the ethical sensitivities associated with real-world, user-submitted data and recommend cautious deployment and analysis, particularly for studies involving personally identifiable or sensitive content.

6. Implications for Future Deepfake Detection Research

The introduction of DeepFake-Eval-2024 fundamentally challenges the community’s reliance on legacy datasets for both model design and claims of generalizability. The documented AUC drops underscore:

The critical need for continually refreshed, representative data to track the evolving landscape of generative models and manipulations.
The limitations of classical training regimes and the increasing necessity of hybrid, multimodal, and robust cross-domain learning strategies.
The requirement for error analysis tools and adaptation mechanisms that can rapidly cope with unseen distributions and emerging attack vectors.

This benchmark thus serves not only as an evaluation standard, but as a clear call for innovation in model architectures and continuous learning protocols to bridge the deployment gap between academic research and effective, operational deepfake detection.

7. Summary Table: Key Features and Results

Feature	DeepFake-Eval-2024 Value
Video	44 hours
Audio	56.5 hours
Images	1,975 items
Coverage	88 websites, 52 languages
English Proportion	78.7%
Commercial Model Acc.	78–89% (varies by modality)
Top Open-Source AUC Drop	45–50%
Public Access	https://github.com/nuriachandra/Deepfake-Eval-2024
Notable Gap	Automated systems lag behind forensic analysts

In summary, DeepFake-Eval-2024 sets a contemporary standard for in-the-wild deepfake detection evaluation, exposing the inadequacy of previous academic models and benchmarks, and providing an urgent directive for the development of techniques robust to the complex, multimodal, and evolving threat landscape of generative manipulation in real-world media (Chandra et al., 4 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024 (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DeepFake-Eval-2024.