Papers
Topics
Authors
Recent
2000 character limit reached

AV-Deepfake1M++ Dataset

Updated 4 December 2025
  • AV-Deepfake1M++ is a comprehensive, multi-type audio–visual dataset featuring over 2 million clips with precise, segment-level annotations.
  • It challenges deepfake detection models by supporting both binary classification and fine-grained temporal localization tasks.
  • Its multimodal design and diverse manipulation techniques offer a robust benchmark for evaluating detector performance against unseen spoofing attacks.

AV-Deepfake1M++ is a large-scale, multi-type, partially manipulated audio–visual video dataset developed to advance research in deepfake detection and temporal localization. Building on the original AV-Deepfake1M dataset, AV-Deepfake1M++ introduces a higher volume of clips, expanded manipulation diversity, and precise segment-level annotations. The benchmark is specifically constructed to challenge next-generation, multimodal deepfake detectors under real-world variability and unseen attack conditions (Kukanov et al., 10 Aug 2025).

1. Dataset Construction and Statistics

AV-Deepfake1M++ comprises over 2 million audio–visual video clips, systematically partitioned into training, validation, and held-out test splits:

Statistic Training Set Validation Set
Total clips 1,099,217 77,326
Real 297,389 (27.05%) 20,220 (26.15%)
Audio modified 266,116 (24.21%) 18,938 (24.49%)
Visual modified 269,901 (24.55%) 19,099 (24.70%)
Both modified 265,812 (24.18%) 19,069 (24.66%)
Mean duration (s) 9.60 9.56

Clip duration ranges from 1.32 to 668.48 seconds (min/max in train), with a mean of approximately 9.6 seconds. Each video is labeled as real, audio_modified, visual_modified, or both_modified, with labels balanced across splits. The dataset introduces fine-grained segment-level manipulations, with 0–5 fake segments per clip and a mean segment duration of 0.33 seconds—roughly corresponding to a single short word. This precise granularity supports research on detection of brief, context-sensitive tampering—adversarially selected words are commonly targeted for manipulation.

Source data is obtained via large-scale web scraping of speaking-head videos, which are then processed for partial deepfake attacks. Audio extraction leverages torchvision’s read_video utility at 16 kHz sampling, while visual preprocessing employs MediaPipe FaceMesh for mouth region-of-interest (ROI) detection. For model development, clips are subsetted to ≤6 seconds for audio models (training on 4 s context windows) and ≤10.24 seconds (≤256 frames) for video models.

2. Labels, Annotation Scheme, and Ground Truth

Annotation follows a dual-level structure:

  • Classification labels (Task 1): Each clip is mapped to a binary real/fake label—“fake” if any audio or visual modification is present. Only video-level labels are available for training.
  • Temporal localization labels (Task 2): Frame-level ground truth adopts the BILOU scheme (Outside, Begin, Inside, Last), permitting derivation of exact start and end timestamps for each manipulated segment. Segment boundaries have frame-rate-level or 40 ms audio precision. No per-segment confidence scores are provided.

This annotation strategy ensures both coarse (clip-level) and fine-grained (frame-level) supervision, fundamental for tasks such as detection and temporal boundary estimation in partially deepfaked content.

3. Supported Tasks and Benchmark Metrics

AV-Deepfake1M++ supports two principal evaluation tasks, structured to enable both classification and fine localization:

Task 1 – Deepfake Classification

Single-speaker video clips are classified as either real or fake (y{0,1}y \in \{0,1\}), using the metric:

AUC=01TPR(FPR1(t))dt\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}^{-1}(t))\,dt

Task 2 – Temporal Localization

Temporal localization requires precise detection of manipulated time intervals. Key metrics include:

  • Intersection-over-Union (IoU):

IoU=T^TT^T\mathrm{IoU} = \frac{| \hat{T} \cap T |}{| \hat{T} \cup T |}

  • Average Precision (AP@IoU): Evaluated at thresholds {0.5, 0.75, 0.9, 0.95}.
  • Average Recall (AR@N): Assessed for maximum N predicted segments {50, 30, 20, 10, 5}.
  • Combined Score:

12meanAP+12meanAR\frac{1}{2} \cdot \mathrm{mean}\,\mathrm{AP} + \frac{1}{2} \cdot \mathrm{mean}\,\mathrm{AR} Averaged over the specified thresholds and segment limits.

Baseline Results on TestA

Method Val AUC TestA AUC TestA IoU
Video TCN (visual) 88.41 73.11 0.1139
Audio Wav2Vec-AASIST 99.71 82.91
Multimodal avg-fusion 97.86 91.97
KLASSify-BAM (audio-only) 0.3536
KLASSify (max-out fusion) 98.04 92.78

KLASSify’s multimodal architecture, employing Platt-calibrated max-out fusion, achieves a TestA classification AUC of 92.78% and an audio-only IoU of 0.3536 (Kukanov et al., 10 Aug 2025).

4. Modalities, Feature Representations, and Modeling Approaches

AV-Deepfake1M++ mandates modeling both audio and visual streams:

  • Audio Modality:
    • Front-end: Wav2Vec 2.0 XLSR-53 self-supervised encoder (1024-D output).
    • Backbone: AASIST spectro-temporal Graph Attention Network for utterance- and frame-level detection.
    • Extensive data augmentations: including codec artifacts, RTP-generated RIRs, band-pass/stop filtering, pitch/time shifts, and additive noise.
    • Challenges include detection of extremely short fake segments (partial utterances) and domain mismatch due to varying recording conditions.
  • Visual Modality:
    • Per-frame handcrafted temporal features, extracted from mouth ROI and background:
    • 1. Mouth blurriness (Laplacian variance)
    • 2. Non-mouth frame-to-frame mean squared error
    • 3. Mouth ROI color-shift in Lab space
    • 4. Landmark kinematics (aspect ratio, velocity, acceleration, jerk, jitter)
    • Features are processed with a lightweight 1D Temporal Convolutional Network (TCN, ≈124K parameters).
    • Key visual confounders involve lighting, head pose, facial occlusions, and backgrounds remaining static under mouth-only manipulations.

This multimodal feature engineering is designed for interpretability, adaptability, and efficient boundary-aware inference.

5. Comparison to AV-Deepfake1M and Other Benchmarks

AV-Deepfake1M++ extends the scope and rigor of earlier deepfake benchmarks:

  • Scale: Increases corpus size to over 2 million clips, doubling AV-Deepfake1M’s volume.
  • Manipulation Granularity: Supports partial, segment-level manipulations (mean duration 0.33 s), rather than solely full-clip fakes.
  • Diversity: Augments talking-head lip-sync attacks with advanced TTS (zero-shot, VITS), novel audio codecs, and environmental reverberation effects.
  • Labeling: Four-way class structure (real, audio_modified, visual_modified, both_modified) in contrast to binary labeling prevalent in most deepfake datasets.
  • Generalization Benchmarking: Held-out TestA/TestB splits include previously unseen attack types, explicitly designed for stress-testing detector robustness and generalization over novel spoofing strategies.

A plausible implication is that AV-Deepfake1M++ enables research into cross-dataset generalization (e.g., PartialSpoof → AV-Deepfake1M++) and fosters evaluation of models’ capacity under distributional shift.

6. Methodological Implications and Research Directions

The scale and structure of AV-Deepfake1M++ require methodologies that move beyond static, full-clip classifiers:

  • Temporal Modeling: Frame-level temporal models (such as TCNs and boundary-aware attention mechanisms) become essential for detecting short, localized manipulations.
  • Multimodal Architectures: Optimal performance is achieved by combining self-supervised audio backbones (SSL) with graph attention networks and interpretable, handcrafted visual features, fused via score calibration and max-out rules.
  • Best Practices: Include modality-specific Platt sigmoid calibration, sliding-window inference to capture brief segments, loss weighting for BILOU frame-tag heads, and aggressive data augmentation to improve robustness on unseen manipulations.
  • Open Problems: AV-Deepfake1M++ spotlights challenges such as generalization to out-of-domain attacks, developing lightweight boundary-aware detectors, and constructing interpretable feature representations for video-based lip-sync forgeries.

AV-Deepfake1M++ constitutes a comprehensive, fine-grained, and challenging audio–visual benchmark. By integrating diverse manipulation types, precise temporal annotation, and substantial real-world variability, the dataset provides a highly discriminative platform for advancing state-of-the-art deepfake detection and localization research (Kukanov et al., 10 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AV-Deepfake1M++ Dataset.