Celeb-DF: High-Fidelity DeepFake Dataset

Updated 21 January 2026

Celeb-DF is a large-scale, high-fidelity DeepFake video dataset designed for rigorous evaluation of detection models in facial forensics.
It employs advanced synthesis techniques like color-consistency and temporal smoothing to minimize visible artifacts.
The dataset facilitates cross-dataset evaluation and generalization research by providing realistic, diverse facial forgeries.

Celeb-DF is a large-scale, high-fidelity DeepFake video dataset purpose-built for evaluating and advancing generalizable DeepFake detection in facial video forensics. By addressing limitations of earlier benchmarks (notably in realism and diversity of manipulations), Celeb-DF has become a central resource for both model development and rigorous cross-dataset validation. It encompasses thousands of authentic and forged videos of celebrities, synthesizing forgeries with improved pipelines that severely reduce artifact visibility, thereby imposing a substantial challenge for state-of-the-art detectors and enabling research into robustness, detection generalization, and artifact-driven methodologies (Li et al., 2019). This entry reviews Celeb-DF’s dataset design, artifact reduction, evaluation protocols, empirical findings, detection pipelines, and its extension (Celeb-DF++) for cross-manipulation and domain generalization.

1. Dataset Construction and Statistical Profile

Celeb-DF is constructed from 590 real interview videos of 59 celebrities sampled from YouTube, averaging 13 s per clip at 30 fps. These subjects offer diversity across gender (56.8% male, 43.2% female), age (<30 yrs: 6.4%, 30s: 28%, 40s: 26.6%, 50–60: 30.5%, >60: 8.5%), and ethnicity (88.1 % Caucasian, 6.8 % African-American, 5.1 % Asian), ensuring variation in facial characteristics, scene backgrounds, and illumination conditions (Li et al., 2019).

DeepFake videos in Celeb-DF are generated using an enhanced autoencoder-based face-swap pipeline, building on open-source tools (FakeApp/DFaker) and incorporating:

High-resolution synthesis: Outputs at 256×256, achieved by deepening the encoder ( $E$ ) and decoder ( $D_i$ ) networks trained to minimize

$\mathcal{L}(E,D_i)=\mathbb{E}_{x\sim p_i}\|D_i(E(x))-x\|_1$

over all identities.

Color-consistency: Random color perturbations during training, followed by Lab-space color transfer (Reinhard’s method) post-synthesis.
Context-aware masking: Facial-landmark buffering and smooth alpha interpolation along cheeks/chin for seamless blending.
Temporal smoothing: Kalman-filtered 2D facial landmark tracks suppress frame-wise flicker before face reintegration.

The dataset comprises 5,639 DeepFake videos, with each video reflecting tightly-controlled synthesis parameters to reduce detection bias. Overall, Celeb-DF provides approximately 85 hours of video, packaged in MPEG-4, with a cumulative 2.3 million key frames (Li et al., 2019).

2. Artifact Suppression and Dataset Difficulty

Celeb-DF’s synthesis pipeline emphasizes artifact suppression to ensure forgeries resemble authentic footage distributed online. Quantitative assessments using Mask-SSIM (image similarity over facial regions) reveal Celeb-DF achieves 0.92, outperforming predecessors (UADFV: 0.82, DF-TIMIT: 0.80, FF-DF: 0.81, DFDC: 0.84, DFD: 0.88), indicating markedly fewer low-level artifacts (blurring, color mismatch, boundaries).

Qualitatively, Celeb-DF fakes:

Eliminate obvious splicing lines and checkerboard patterns.
Match color distributions seamlessly across the synthetic boundary.
Exhibit stable temporal coherence, hiding common flicker.
Remove “low-hanging fruit” enabling shallow detectors to exploit crude artifacts.

This escalation in visual fidelity ensures models must detect subtle inconsistencies, e.g., micro-expression anomalies or generative process fingerprints, rather than artifact signals visible in first-generation datasets (Li et al., 2019).

3. Evaluation Protocols and Metrics

Detection models are evaluated both intra-dataset (train/test on Celeb-DF) and cross-dataset (train elsewhere, test on Celeb-DF). Sampling is performed at the frame (or video) level using only key frames to avoid re-compression artifacts.

Metrics:

True Positive Rate (TPR):

$\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$

False Positive Rate (FPR):

$\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}}$

Accuracy:

$\mathrm{Acc} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}}$

Area Under the ROC Curve (AUC):

$\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}=t)\,dt$

Frame-level AUCs are reported at high statistical precision for benchmarking (Li et al., 2019).

4. Empirical Performance and Comparative Results

Celeb-DF significantly raises the difficulty for detectors. Mean AUCs of nine top detectors are:

UADFV: ~85%
DF-TIMIT-LQ/HQ: ~88% / ~78%
FF-DF: ~86%
DFD: ~64%
DFDC: ~64%
Celeb-DF: ~58%

When reporting method-specific frame-level AUCs:

Xception-c23: 65.3%
Xception-c40: 65.5%
DSP-FWA: 64.6%
FWA: 56.9%
Meso4: ~54.8%
Two-stream Inception: 53.8%

Detectors trained/validated on compressed data maintain higher resilience to re-compression (e.g., Xception-c23/c40), while warping-artifact-based detectors degrade under heavy H.264 compression (Li et al., 2019).

Recent work attains state-of-the-art results by leveraging advanced architectures. For instance, EfficientNet-B4 trained with MediaPipe-extracted 224×224 crops from Celeb-DF v2 achieves:

Accuracy: 95.52%
Precision: 99.99%
Recall: 91.61%
F₁ Score: 95.62% Frame-wise confusion matrix analysis reveals only 8% of DeepFake frames are missed, virtually no false alarms occur on genuine frames (Lacerda et al., 2022).

Lightweight methods exemplified by DefakeHop (successive subspace learning and channel-wise Saab transform) further highlight architectural innovation; DefakeHop achieves AUCs of 94.95% (Celeb-DF v1, video-level), 90.56% (Celeb-DF v2), with model capacity ( $\approx$ 43k parameters) far less than standard CNNs (Chen et al., 2021).

For cross-dataset generalization, PhaseForensics applies phase-based motion descriptors with separable 3D convolutions and achieves 91.2% video-level AUC on Celeb-DF v2 when trained exclusively on FF++—the highest observed for cross-dataset setups (Prashnani et al., 2022).

5. Detection Pipelines: Architectures and Training

Detection workflows typically include:

Preprocessing: Face/landmark detection (MediaPipe, OpenFace2, RetinaFace+FAN), tight cropping, normalization, resizing (128×128, 224×224).
Feature Extraction: CNN-based architectures (EfficientNet-B4, ResNet/TCN hybrids), SSL/PixelHop++ pipelines (DefakeHop), artifact or frequency sub-band extraction (PhaseForensics).
Dimensionality Reduction and Classification: Soft classifiers (XGBoost; DefakeHop), ensemble temporality (multi-frame aggregation).
Training Details:
- Binary cross-entropy, Adam optimizer.
- Moderate batch sizes, 80 epochs, weight decay regularization.
- Evaluation by held-out test splits, 80/20 validation at the frame level.

Distinct approaches focus on discriminative cues ranging from GAN fingerprints, artifact patterns, physiological semantics (eye, lip movement), and temporal coherence violations. PhaseForensics employs complex steerable pyramid-based phase analysis for robust motion characterization, achieving improved resilience to spatial-altering distortions and adversarial attacks (Prashnani et al., 2022).

6. Extensions: Celeb-DF++ and Generalizability Benchmarks

Celeb-DF++ expands Celeb-DF to over 53,000 DeepFakes and 590 real videos, engineered to systematically cover Face-Swap (FS, 8 methods), Face-Reenactment (FR, 7 methods), and Talking-Face (TF, 7 methods), representing three major forgery scenarios with a total of 22 recent synthesis pipelines (Li et al., 24 Jul 2025).

Data splits and diversity:

Scenario	# Methods	# Fake Videos
Face-Swap	8	≈20,000
Face-Reenactment	7	≈14,000
Talking-Face	7	≈20,000
Total	22	53,196

Generalizability of detectors is scrutinized via:

GF-eval: Train on Celeb-DF face-swap, test on all other manipulation types.
GFQ-eval: As above, with additional compression-induced degradation (H.264 c35/c45).
GFD-eval: Train on FF++, test on all Celeb-DF++ forgeries.

Observed average AUCs:

Protocol (Frame-level)	AUC (%)	Video-level AUC (%)
GF-eval	71.7	72.1
GFQ c35	68.2	69.9
GFQ c45	63.8	69.9
GFD-eval	69.4	73.7

Current models, even top single-model entries (Effort ICML ’25: 83.0%), experience marked performance declines on audio-driven (TF) and full-motion (FR) forgeries, as well as under compression. This reveals the limits of single-modality, artifact-reliant, or compression-sensitive architectures and motivates multi-modal fusion and meta-learning adaptation (Li et al., 24 Jul 2025).

7. Impact, Open Challenges, and Future Research Directions

Celeb-DF has established itself as the definitive “second-generation” DeepFake benchmark, shifting the community towards artifact-agnostic, multi-modal, and temporally-aware detection strategies. It has exposed overfitting in models trained on simpler datasets, especially the inability to generalize under diverse manipulations or real-world post-processing. Recommendations include:

Always testing cross-dataset, especially on Celeb-DF, to assess robustness.
Adopting multi-frame or audio-visual fusion paradigms for fine-grained physiological artifact detection.
Augmenting training with compression, anti-forensic noise, and codec variation.
Integrating spatio-temporal features with artifact-driven and universal representations (Li et al., 2019, Li et al., 24 Jul 2025).

Celeb-DF’s trajectory will incorporate more subjects, longer clips, and synthetic manipulations engineered to preempt detector exploitation (e.g., adversarial noise, frequency blending). Its evolutionary extensions such as Celeb-DF++ have already redefined generalizability standards, catalyzing research into forensics resilient to unseen manipulation types and domain shifts (Li et al., 24 Jul 2025).