AV-Deepfake1M: Deepfake Benchmark Dataset
- AV-Deepfake1M is a large-scale benchmark dataset for audio-visual deepfake detection featuring over 1 million clips with detailed, frame-level annotations across four manipulation types.
- It employs advanced synthetic pipelines—including TTS, face-generation, and voice conversion—to simulate realistic deepfake forgeries and ensure balanced representation across audio, video, and joint attacks.
- The dataset supports rigorous evaluation of detection and temporal localization with precise metrics (AUC, AP) and advanced augmentation protocols, facilitating state-of-the-art multimodal forensic research.
AV-Deepfake1M is a large-scale benchmark dataset specifically designed for the development, training, and rigorous evaluation of audio-visual deepfake detection and temporal localization systems. The dataset provides synchronized video and audio recordings with granular, frame-level ground truth indicating whether, when, and via which modality (audio, video, or both) manipulations have been performed. Its construction encompasses state-of-the-art audio-text, face-generation, and voice-conversion pipelines and establishes balanced categorical splits for robust comparison under challenging multimodal forgeries and partial-segment attacks (Kukanov et al., 10 Aug 2025, Koutlis et al., 15 Nov 2024).
1. Dataset Composition, Classes, and Scale
AV-Deepfake1M consists of over 1 million audiovisual clips, each comprising a talking-head face video aligned with a speech audio track. The design emphasizes four manipulation conditions:
- RVRA: Real Video, Real Audio
- RVFA: Real Video, Fake Audio (audio modified by voice conversion or TTS)
- FVRA: Fake Video, Real Audio (lip/mouth region re-synthesized to match authentic speech)
- FVFA: Fake Video, Fake Audio (simultaneous audio and video manipulation)
Balanced representation is maintained across classes, with training and validation splits containing approximately equal numbers per category. For instance, AV-Deepfake1M specifies (≈186k per class), (≈14.3k per class) (Koutlis et al., 15 Nov 2024). AV-Deepfake1M++ and the AV-Deepfake1M 2025 Challenge employed over 2 million clips, further increasing coverage for all manipulations. The typical clip contains a single visible speaker with mean durations ranging from 4 s (AV-Deepfake1M) to ≈9.6 s (AV-Deepfake1M++), but clip lengths can vary from 1.32 s up to 668.48 s in training and 152.40 s in validation (Kukanov et al., 10 Aug 2025).
Summary Statistics (AV-Deepfake1M++)
| Statistic | Train | Validation |
|---|---|---|
| Total clips | 1,099,217 | 77,326 |
| Real | 297,389 (27.05%) | 20,220 (26.15%) |
| Audio Modified | 266,116 (24.21%) | 18,938 (24.49%) |
| Visual Modified | 269,901 (24.55%) | 19,099 (24.70%) |
| Both Modified | 265,812 (24.18%) | 19,069 (24.66%) |
| Clip duration (s) | 1.32–668.48 (mean 9.60) | 2.52–152.40 (mean 9.56) |
| Attacks per clip | 0–5 | 0–5 |
| Fake segment duration (s) | 0.02–11.58 (mean 0.33) | 0.02–8.10 (mean 0.33) |
This composition is a direct response to the inadequacies of unimodal and coarsely manipulated benchmarks, supporting subtle, short, and temporally local forgeries on a massive scale (Kukanov et al., 10 Aug 2025).
2. Manipulation Types and Data Generation Protocols
Audio and video manipulations are performed independently and jointly, leveraging state-of-the-art synthetic techniques:
- Audio Attacks:
- Advanced text-to-speech (TTS), including zero-shot models and VITS-based prosody cloning.
- Simulated codec degradations (HTK, GSM, AMR-NB, SPH, Vorbis, OGG, MP3, WAV; multiple bit depths and encodings).
- Video Attacks:
- Audio-driven lip-sync and talking-head models (LatentSyncTaming, Diff2Lip-style, flow-based talking-head synthesis).
- Re-synthesis is limited to the mouth region, while the remainder of the face remains unmodified.
- Joint Attacks:
- Simultaneous TTS-based audio spoofing and temporally correlated video lip-sync.
These manipulations are applied to scripted utterances spoken by real individuals, with manipulation intervals varying per sample. Empirically, manipulated segments in AV-Deepfake1M are, on average, half the length of those in prior LAV-DF benchmarks, with a mean of ≈1.3 fake segments per clip and segment durations averaging 0.33 s (Koutlis et al., 15 Nov 2024, Kukanov et al., 10 Aug 2025).
Manipulation Pipeline (Pseudocode excerpt)
1 2 3 4 5 6 |
for each original clip v,a:
if sample in FVFA or FVRA:
apply deepfake_video_model(v) → v'
if sample in FVFA or RVFA:
apply voice_conversion(a) → a'
store (v',a') with ground-truth intervals |
3. Modalities, Feature Extraction, and Augmentation
Video Modality
- Input: Face-centric video (25–30 fps)
- Landmark Extraction: MediaPipe Face Mesh (468 points) to isolate mouth-ROI and non-mouth face.
- Handcrafted Temporal Features:
- Mouth-ROI blurriness (variance of Laplacian)
- Non-mouth MSE between consecutive frames
- Lab-space mouth-ROI color shift 4–8. Lip landmark kinematics: aspect ratio, velocity, acceleration, jerk, and linear-fit jitter
- Sequence Modeling:
1D TCN with four residual blocks (dilations 1–16), attention pooling, and an MLP head (≈124k parameters for classification; ≈140k with a BILOU tagging head for localization) (Kukanov et al., 10 Aug 2025).
Audio Modality
- Input: Raw speech waveform (16 kHz).
- Encoders:
- Wav2Vec 2.0 XLSR-53 (classification, 1024-D output)
- WavLM-Large SSL (temporal localization at 40 ms frame resolution)
- Classifier/Head:
- AASIST (graph-attention network for spectro-temporal cues)
- Boundary-aware Attention Mechanism (BAM) for segment localization
- Input Context: 4 s window for both tasks, with wrapping/truncation for variable-length clips.
Augmentation Protocols
- Video:
- Features: channel-wise normalization, feature-shifting, temporal dropout (random frame drops), index swap, full-channel dropout, additive Gaussian noise.
- Audio:
- audiomentations and torchaudio/SOX backend, with room-impulse responses, time/stretch, pitch-shift, Gaussian noise, filtering, and codec simulation, each with .
4. Annotation, Labeling, and Evaluation Protocols
Labels and Supervision
- Tasks:
- Deepfake Classification (DFD): Clip-level real/fake labels (weak supervision)
- Temporal Forgery Localization (TFL): Frame-level binary labels and boundary-aware BILOU tags (B=Begin, I=Inside, L=Last, O=Outside of manipulated segment) (Kukanov et al., 10 Aug 2025).
- Distance-to-boundary regression targets for each frame and each modality (Koutlis et al., 15 Nov 2024).
- Ground Truth Format:
- Binary modality labels ( per frame for ).
- Start/end intervals or per-frame distances to each boundary; no pixel-level masks.
Evaluation Metrics
- Classification:
Area Under the ROC Curve (AUC):
where and are the sets of fake and real samples, respectively.
- Localization:
- Average Precision (AP) at IoU thresholds
- Average Recall (AR) at proposals
- Combined Score:
Access and Evaluation
Data Availability:
- Train and validation splits: publicly downloadable from the authors’ repository (Cai et al.).
- Test set: metadata not public; evaluation is managed via Codabench (Koutlis et al., 15 Nov 2024).
- Licensing: Non-commercial research use only.
5. Benchmarking, Baselines, and Key Results
Performance benchmarking on AV-Deepfake1M and AV-Deepfake1M++ tracks advances in both unimodal and multimodal detection settings. The dataset’s benchmark structure enforces balanced, identity-disjoint splits and supports both real-fake and partial-modality detection scenarios.
Classification Task (AUC, %)
| Method | Val AUC | TestA AUC |
|---|---|---|
| Video TCN (features 1,2) | 84.29 | 69.31 |
| Video TCN (+color shift) | 85.48 | 72.00 |
| Video TCN (+landmark kinematics) | 88.41 | 73.11 |
| Wav2Vec-AASIST (ASV19 baseline) | 73.48 | 60.53 |
| Wav2Vec-AASIST-codecs | 99.71 | 82.91 |
| Audio + Video AVG fusion | 97.86 | 91.97 |
| KLASSify (calibrated max-out) | 98.04 | 92.78 |
Temporal Localization (IoU and AP metrics, TestA)
| Method | IoU | [email protected] | [email protected] | [email protected] | [email protected] | AR@50 |
|---|---|---|---|---|---|---|
| Video TCN (4 features) | 0.1139 | – | – | – | – | – |
| KLASSify-BAM (audio) | 0.3536 | 0.5117 | 0.4017 | 0.1701 | 0.0416 | 0.4259 |
SOTA Performance (AV-Deepfake1M, DiMoDif (Koutlis et al., 15 Nov 2024))
- Classification AUC: Up to 96.3% (DiMoDif, audio-visual, test set)
- Temporal Localization: [email protected] of 86.9%, [email protected] of 75.9% (DiMoDif)
Table 2 and Table 3 (above) reflect robust performance improvements between unimodal and multimodal approaches and the impact of feature and network design choices.
6. Impact, Limitations, and Research Directions
AV-Deepfake1M establishes a new evaluation paradigm prioritizing (1) balanced, multi-category deepfake coverage, (2) temporally local and subtle manipulation intervals, and (3) support for weak- and fully supervised detection/localization pipelines. Its scale, manipulation diversity, and complete temporal labeling enable the community to:
- Benchmark interpretable models leveraging handcrafted features and lightweight TCNs (Kukanov et al., 10 Aug 2025)
- Develop advanced multimodal fusion and cross-modal discrepancy analysis—e.g., DiMoDif’s hierarchical cross-modal fusion with adaptive alignment and discrepancy mapping (Koutlis et al., 15 Nov 2024)
- Explore resilience to codec-induced degradations and real-world preprocessing noise
- Advance fine-grained temporal localization of deepfake forgeries, a critical capability for content-tracing and forensic applications
Certain implementation details—including exact per-sample frame rates, resolutions, and underlying source datasets—are not specified in all descriptions and may vary between AV-Deepfake1M and AV-Deepfake1M++. Access to test labels is restricted, requiring server-side evaluation. A plausible implication is that reproducibility is very high for model comparisons but that custom downstream uses may require adaptation or reference to the original license (Koutlis et al., 15 Nov 2024).
The dataset is widely adopted for recent state-of-the-art multimodal detection challenges, shaping the direction of practical research in audio-visual deepfake forensics.