Papers
Topics
Authors
Recent
2000 character limit reached

Celeb-DF Dataset Overview

Updated 8 December 2025
  • Celeb-DF is a high-fidelity, large-scale video dataset featuring real and synthetically manipulated celebrity videos for benchmarking deepfake detection.
  • The dataset includes thousands of videos of 59 celebrities, generated using state-of-the-art face-swap, reenactment, and talking-face pipelines with flexible evaluation protocols.
  • Celeb-DF++ extends the resource with higher resolution, diverse forgery methods, and rigorous performance benchmarks that challenge current deepfake detection algorithms.

Celeb-DF is a large-scale, high-fidelity video dataset designed for deepfake forensics, providing a comprehensive and realistic benchmark for the development and evaluation of video forgery detection methods. It features thousands of real and synthetically manipulated videos of 59 celebrity identities, generated using state-of-the-art face-swap, face-reenactment, and talking-face pipelines. Through rigorous synthesis methodologies and challenging evaluation protocols, Celeb-DF and its expanded successor Celeb-DF++ establish a new standard for generalizability assessment in deepfake detection due to their visual realism, forgery diversity, and resistance to detection methods relying on low-level artifacts.

1. Dataset Composition and Scope

Celeb-DF comprises 590 real interview videos (∼225.4K frames) and 5,639 deepfake videos (∼2.12M frames) featuring 59 celebrities, with balanced gender and a broad age/ethnicity distribution. Each deepfake is generated by state-of-the-art face manipulation pipelines, including identity-swap and expression-swap techniques. The real videos are sourced from YouTube interviews, standardized to a frame rate of 30 fps and typical HD resolution (≥720×480). The average length is approximately 13 seconds per video. Celeb-DF provides no fixed training or evaluation splits, supporting flexible protocols such as cross-subject and cross-dataset evaluation (Li et al., 2019).

Celeb-DF++ extends the original dataset by nearly an order of magnitude, encompassing 53,196 fake clips (∼15M frames) at 512×512 resolution. This expansion introduces three forgery scenarios: Face-swap (FS, 8 methods), Face-reenactment (FR, 7 methods), and Talking-face (TF, 7 methods), for a total of 22 contemporary manipulation methods, with TF now representing about 50% of generated forgeries. The source pool and real data curation remain consistent, but the forgery diversity and data scale are substantially increased for evaluations targeting generalizability (Li et al., 24 Jul 2025).

2. Synthesis Pipeline and Technical Advancements

The Celeb-DF autoencoder-based pipeline enhances prior face swap procedures by introducing:

  • Higher-resolution synthesis: Increasing swapped face patches to 256×256 or 512×512, supported by deeper, more expressive encoder-decoder architectures.
  • Color transfer robustness: Synthetic faces undergo random perturbations and then color transfer (per Reinhard et al. 2001) to match target frame distribution.
  • Advanced mask generation: Mask boundaries are fitted via spline interpolation over key facial landmarks and softened (by Gaussian blur), minimizing visible compositing seams.
  • Temporal landmark smoothing: Kalman smoothing of facial landmarks across frames ensures temporal coherence and suppresses flicker (Li et al., 2019).

Celeb-DF++ incorporates generation pipelines that range from classical AE, GAN-based identity-preserving encoders (SimSwap, UniFace, BlendFace), 3D prior methods, and distillation-enhanced networks for FS, to deep 3D reconstruction, thin-plate spline flow (TPSMM), hypernetworks, memory-augmented models for FR, and 3D model-based, disentangled audio-driven architectures for TF. Each forgery tool targets different facial regions and synthesis artifacts, thus modeling a representative selection of methods active in the deepfake landscape as of 2025 (Li et al., 24 Jul 2025).

3. Visual Quality and Forensic Challenges

Celeb-DF videos achieve the highest Mask-SSIM (0.92) among contemporary datasets such as UADFV, FF-DF, DFD, and DFDC—reflecting superior visual fidelity and reduced detection-relevant artifacts. Advanced blending and color alignment, together with high spatial resolution and stabilized temporal landmarks, remove obvious cues exploited by first-generation detectors (e.g., splicing borders, color mismatches, or jitter) (Li et al., 2019).

Failure analyses demonstrate that such realism renders frame-based algorithms ineffective when trained on artifact-prone benchmarks. Subtle interpolation artifacts, if present, are further obscured at high resolution, and behavioral or physiological inconsistencies are often absent in professionally filmed interview scenarios.

4. Evaluation Protocols and Detection Benchmarks

Celeb-DF and Celeb-DF++ introduce rigorous protocols to assess true cross-method and cross-domain generalization:

  • Intra-dataset transfer: Detectors trained on FaceForensics++ (HQ) and tested on Celeb-DF or Celeb-DF++; observed a mean frame-AUC drop from 74.8% (CDF) to 69.6% (CDF++), with similar trends at the video level.
  • Generalized Forgery evaluation (GF-eval): Training on Celeb-DF FS, testing on all new methods in CDF++, measuring intra-dataset, cross-method transfer (mean frame-AUC 71.7%).
  • Generalized Forgery across Quality (GFQ-eval): As GF-eval, but testing with H.264 compression at CRF 35/45; performance degrades to 68.2%/63.8% frame-AUC.
  • Generalized Forgery across Datasets (GFD-eval): Training on FaceForensics++ (HQ), testing on CDF++; average frame-AUC 69.4%. Notably, robust results are limited to FS forgeries (AUC >85%), while FR and TF forgeries yield degraded detection rates (AUC 50–70%) (Li et al., 24 Jul 2025).

A summary of core protocol results:

Protocol Frame-AUC (%) Video-AUC (%)
Intra (FF++→CDF v2) 74.8 80.9
Intra (FF++→CDF++) 69.6 73.8
GF-eval 71.7 72.1
GFQ-eval (c35) 68.2 69.9
GFQ-eval (c45) 63.8 69.9
GFD-eval 69.4 73.7

These results make explicit the brittleness of current detectors to both method variance and domain shift, highlighting the need for new universal forensic representations (see Section 6).

5. Application to Face Recognition and Deepfake Detection Research

Celeb-DF has become the benchmark of choice for evaluating the generalizability of deepfake detection algorithms, notably demonstrating that performance diminishes severely in cross-dataset or unseen forgery scenarios.

  • Face recognition–based detection: Deep face recognition models (ResNet-50 backbone with CosFace loss, pretrained on MS1M-ArcFace or similar) framing detection as identity verification achieve AUC of 0.98 and EER as low as 7.1%, outperforming two-class CNNs and ocular-modal detectors (AUC 0.806, EER 26.7%). These findings show that facial embeddings robustly encode identity-related perturbations from swapping/blending, even without exposure to synthetic data during training (Ramachandran et al., 2021).
  • Challenging cases: Expression-swapping manipulations (Face2Face, NeuralTextures) do not disturb core identity cues, yielding high error rates (EER >30–48%).
  • Compression and pre/post-processing: Detectors trained or evaluated on recompressed videos (H.264, different QPs) show variable robustness; Xception-c23/40 trained on compressed data are less affected under similar conditions (Li et al., 2019).

6. Limitations, Open Challenges, and Future Directions

Despite setting new benchmarks for realism and diversity, the current generation of detectors remains sensitive to forger-specific artifacts, with limited ability to generalize across manipulation types (e.g., FS→FR/TF) and across datasets. Compression further erases subtle spatial–frequency features, with a 4–8% AUC decrease common under severe recompression (Li et al., 24 Jul 2025). Cross-lab variations in protocols, backgrounds, and resolutions also present persistent obstacles.

Future work will require:

  • Unified feature representations invariant across forgery scenarios, domains, and compression levels (e.g., leveraging physiological clues or temporal coherence).
  • Self-supervised augmentation to simulate greater diversity in fake generation at training.
  • Robust adaptation modules (e.g., lightweight forensics adapters) able to address unseen manipulation methods beyond blending boundaries.
  • Multi-modal detection harnessing consistency between audio, lip motion, and speaker identity, especially for TF and FR forgeries.
  • Dataset enrichment with new modalities (neural rendering, zero-shot speech-driven duplicates), adversarial anti-forensics, and complex, multi-subject scenes.

Celeb-DF and Celeb-DF++ provide a scalable testbed for these ongoing research directions and are fundamental resources for the rigorous assessment of generalizable video deepfake detection (Li et al., 24 Jul 2025, Li et al., 2019).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Celeb-DF Dataset.