AV-Deepfake1M++ Dataset

Updated 4 December 2025

AV-Deepfake1M++ is a comprehensive, multi-type audio–visual dataset featuring over 2 million clips with precise, segment-level annotations.
It challenges deepfake detection models by supporting both binary classification and fine-grained temporal localization tasks.
Its multimodal design and diverse manipulation techniques offer a robust benchmark for evaluating detector performance against unseen spoofing attacks.

AV-Deepfake1M++ is a large-scale, multi-type, partially manipulated audio–visual video dataset developed to advance research in deepfake detection and temporal localization. Building on the original AV-Deepfake1M dataset, AV-Deepfake1M++ introduces a higher volume of clips, expanded manipulation diversity, and precise segment-level annotations. The benchmark is specifically constructed to challenge next-generation, multimodal deepfake detectors under real-world variability and unseen attack conditions (Kukanov et al., 10 Aug 2025).

1. Dataset Construction and Statistics

AV-Deepfake1M++ comprises over 2 million audio–visual video clips, systematically partitioned into training, validation, and held-out test splits:

Statistic	Training Set	Validation Set
Total clips	1,099,217	77,326
Real	297,389 (27.05%)	20,220 (26.15%)
Audio modified	266,116 (24.21%)	18,938 (24.49%)
Visual modified	269,901 (24.55%)	19,099 (24.70%)
Both modified	265,812 (24.18%)	19,069 (24.66%)
Mean duration (s)	9.60	9.56

Clip duration ranges from 1.32 to 668.48 seconds (min/max in train), with a mean of approximately 9.6 seconds. Each video is labeled as real, audio_modified, visual_modified, or both_modified, with labels balanced across splits. The dataset introduces fine-grained segment-level manipulations, with 0–5 fake segments per clip and a mean segment duration of 0.33 seconds—roughly corresponding to a single short word. This precise granularity supports research on detection of brief, context-sensitive tampering—adversarially selected words are commonly targeted for manipulation.

Source data is obtained via large-scale web scraping of speaking-head videos, which are then processed for partial deepfake attacks. Audio extraction leverages torchvision’s read_video utility at 16 kHz sampling, while visual preprocessing employs MediaPipe FaceMesh for mouth region-of-interest (ROI) detection. For model development, clips are subsetted to ≤6 seconds for audio models (training on 4 s context windows) and ≤10.24 seconds (≤256 frames) for video models.

2. Labels, Annotation Scheme, and Ground Truth

Annotation follows a dual-level structure:

Classification labels (Task 1): Each clip is mapped to a binary real/fake label—“fake” if any audio or visual modification is present. Only video-level labels are available for training.
Temporal localization labels (Task 2): Frame-level ground truth adopts the BILOU scheme (Outside, Begin, Inside, Last), permitting derivation of exact start and end timestamps for each manipulated segment. Segment boundaries have frame-rate-level or 40 ms audio precision. No per-segment confidence scores are provided.

This annotation strategy ensures both coarse (clip-level) and fine-grained (frame-level) supervision, fundamental for tasks such as detection and temporal boundary estimation in partially deepfaked content.

3. Supported Tasks and Benchmark Metrics

AV-Deepfake1M++ supports two principal evaluation tasks, structured to enable both classification and fine localization:

Task 1 – Deepfake Classification

Single-speaker video clips are classified as either real or fake ( $y \in \{0,1\}$ ), using the metric:

$\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}^{-1}(t))\,dt$

Task 2 – Temporal Localization

Temporal localization requires precise detection of manipulated time intervals. Key metrics include:

Intersection-over-Union (IoU):

$\mathrm{IoU} = \frac{| \hat{T} \cap T |}{| \hat{T} \cup T |}$

Average Precision (AP@IoU): Evaluated at thresholds {0.5, 0.75, 0.9, 0.95}.
Average Recall (AR@N): Assessed for maximum N predicted segments {50, 30, 20, 10, 5}.
Combined Score:

$\frac{1}{2} \cdot \mathrm{mean}\,\mathrm{AP} + \frac{1}{2} \cdot \mathrm{mean}\,\mathrm{AR}$ Averaged over the specified thresholds and segment limits.

Baseline Results on TestA

Method	Val AUC	TestA AUC	TestA IoU
Video TCN (visual)	88.41	73.11	0.1139
Audio Wav2Vec-AASIST	99.71	82.91	–
Multimodal avg-fusion	97.86	91.97	–
KLASSify-BAM (audio-only)	–	–	0.3536
KLASSify (max-out fusion)	98.04	92.78	–

KLASSify’s multimodal architecture, employing Platt-calibrated max-out fusion, achieves a TestA classification AUC of 92.78% and an audio-only IoU of 0.3536 (Kukanov et al., 10 Aug 2025).

4. Modalities, Feature Representations, and Modeling Approaches

AV-Deepfake1M++ mandates modeling both audio and visual streams:

Audio Modality:
- Front-end: Wav2Vec 2.0 XLSR-53 self-supervised encoder (1024-D output).
- Backbone: AASIST spectro-temporal Graph Attention Network for utterance- and frame-level detection.
- Extensive data augmentations: including codec artifacts, RTP-generated RIRs, band-pass/stop filtering, pitch/time shifts, and additive noise.
- Challenges include detection of extremely short fake segments (partial utterances) and domain mismatch due to varying recording conditions.
Visual Modality:
- Per-frame handcrafted temporal features, extracted from mouth ROI and background:
- 1. Mouth blurriness (Laplacian variance)
- 2. Non-mouth frame-to-frame mean squared error
- 3. Mouth ROI color-shift in Lab space
- 4. Landmark kinematics (aspect ratio, velocity, acceleration, jerk, jitter)
- Features are processed with a lightweight 1D Temporal Convolutional Network (TCN, ≈124K parameters).
- Key visual confounders involve lighting, head pose, facial occlusions, and backgrounds remaining static under mouth-only manipulations.

This multimodal feature engineering is designed for interpretability, adaptability, and efficient boundary-aware inference.

5. Comparison to AV-Deepfake1M and Other Benchmarks

AV-Deepfake1M++ extends the scope and rigor of earlier deepfake benchmarks:

Scale: Increases corpus size to over 2 million clips, doubling AV-Deepfake1M’s volume.
Manipulation Granularity: Supports partial, segment-level manipulations (mean duration 0.33 s), rather than solely full-clip fakes.
Diversity: Augments talking-head lip-sync attacks with advanced TTS (zero-shot, VITS), novel audio codecs, and environmental reverberation effects.
Labeling: Four-way class structure (real, audio_modified, visual_modified, both_modified) in contrast to binary labeling prevalent in most deepfake datasets.
Generalization Benchmarking: Held-out TestA/TestB splits include previously unseen attack types, explicitly designed for stress-testing detector robustness and generalization over novel spoofing strategies.

A plausible implication is that AV-Deepfake1M++ enables research into cross-dataset generalization (e.g., PartialSpoof → AV-Deepfake1M++) and fosters evaluation of models’ capacity under distributional shift.

6. Methodological Implications and Research Directions

The scale and structure of AV-Deepfake1M++ require methodologies that move beyond static, full-clip classifiers:

Temporal Modeling: Frame-level temporal models (such as TCNs and boundary-aware attention mechanisms) become essential for detecting short, localized manipulations.
Multimodal Architectures: Optimal performance is achieved by combining self-supervised audio backbones (SSL) with graph attention networks and interpretable, handcrafted visual features, fused via score calibration and max-out rules.
Best Practices: Include modality-specific Platt sigmoid calibration, sliding-window inference to capture brief segments, loss weighting for BILOU frame-tag heads, and aggressive data augmentation to improve robustness on unseen manipulations.
Open Problems: AV-Deepfake1M++ spotlights challenges such as generalization to out-of-domain attacks, developing lightweight boundary-aware detectors, and constructing interpretable feature representations for video-based lip-sync forgeries.

AV-Deepfake1M++ constitutes a comprehensive, fine-grained, and challenging audio–visual benchmark. By integrating diverse manipulation types, precise temporal annotation, and substantial real-world variability, the dataset provides a highly discriminative platform for advancing state-of-the-art deepfake detection and localization research (Kukanov et al., 10 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AV-Deepfake1M++ Dataset.