AV-Deepfake1M++ Dataset
- AV-Deepfake1M++ is a comprehensive, multi-type audio–visual dataset featuring over 2 million clips with precise, segment-level annotations.
- It challenges deepfake detection models by supporting both binary classification and fine-grained temporal localization tasks.
- Its multimodal design and diverse manipulation techniques offer a robust benchmark for evaluating detector performance against unseen spoofing attacks.
AV-Deepfake1M++ is a large-scale, multi-type, partially manipulated audio–visual video dataset developed to advance research in deepfake detection and temporal localization. Building on the original AV-Deepfake1M dataset, AV-Deepfake1M++ introduces a higher volume of clips, expanded manipulation diversity, and precise segment-level annotations. The benchmark is specifically constructed to challenge next-generation, multimodal deepfake detectors under real-world variability and unseen attack conditions (Kukanov et al., 10 Aug 2025).
1. Dataset Construction and Statistics
AV-Deepfake1M++ comprises over 2 million audio–visual video clips, systematically partitioned into training, validation, and held-out test splits:
| Statistic | Training Set | Validation Set |
|---|---|---|
| Total clips | 1,099,217 | 77,326 |
| Real | 297,389 (27.05%) | 20,220 (26.15%) |
| Audio modified | 266,116 (24.21%) | 18,938 (24.49%) |
| Visual modified | 269,901 (24.55%) | 19,099 (24.70%) |
| Both modified | 265,812 (24.18%) | 19,069 (24.66%) |
| Mean duration (s) | 9.60 | 9.56 |
Clip duration ranges from 1.32 to 668.48 seconds (min/max in train), with a mean of approximately 9.6 seconds. Each video is labeled as real, audio_modified, visual_modified, or both_modified, with labels balanced across splits. The dataset introduces fine-grained segment-level manipulations, with 0–5 fake segments per clip and a mean segment duration of 0.33 seconds—roughly corresponding to a single short word. This precise granularity supports research on detection of brief, context-sensitive tampering—adversarially selected words are commonly targeted for manipulation.
Source data is obtained via large-scale web scraping of speaking-head videos, which are then processed for partial deepfake attacks. Audio extraction leverages torchvision’s read_video utility at 16 kHz sampling, while visual preprocessing employs MediaPipe FaceMesh for mouth region-of-interest (ROI) detection. For model development, clips are subsetted to ≤6 seconds for audio models (training on 4 s context windows) and ≤10.24 seconds (≤256 frames) for video models.
2. Labels, Annotation Scheme, and Ground Truth
Annotation follows a dual-level structure:
- Classification labels (Task 1): Each clip is mapped to a binary real/fake label—“fake” if any audio or visual modification is present. Only video-level labels are available for training.
- Temporal localization labels (Task 2): Frame-level ground truth adopts the BILOU scheme (Outside, Begin, Inside, Last), permitting derivation of exact start and end timestamps for each manipulated segment. Segment boundaries have frame-rate-level or 40 ms audio precision. No per-segment confidence scores are provided.
This annotation strategy ensures both coarse (clip-level) and fine-grained (frame-level) supervision, fundamental for tasks such as detection and temporal boundary estimation in partially deepfaked content.
3. Supported Tasks and Benchmark Metrics
AV-Deepfake1M++ supports two principal evaluation tasks, structured to enable both classification and fine localization:
Task 1 – Deepfake Classification
Single-speaker video clips are classified as either real or fake (), using the metric:
Task 2 – Temporal Localization
Temporal localization requires precise detection of manipulated time intervals. Key metrics include:
- Intersection-over-Union (IoU):
- Average Precision (AP@IoU): Evaluated at thresholds {0.5, 0.75, 0.9, 0.95}.
- Average Recall (AR@N): Assessed for maximum N predicted segments {50, 30, 20, 10, 5}.
- Combined Score:
Averaged over the specified thresholds and segment limits.
Baseline Results on TestA
| Method | Val AUC | TestA AUC | TestA IoU |
|---|---|---|---|
| Video TCN (visual) | 88.41 | 73.11 | 0.1139 |
| Audio Wav2Vec-AASIST | 99.71 | 82.91 | – |
| Multimodal avg-fusion | 97.86 | 91.97 | – |
| KLASSify-BAM (audio-only) | – | – | 0.3536 |
| KLASSify (max-out fusion) | 98.04 | 92.78 | – |
KLASSify’s multimodal architecture, employing Platt-calibrated max-out fusion, achieves a TestA classification AUC of 92.78% and an audio-only IoU of 0.3536 (Kukanov et al., 10 Aug 2025).
4. Modalities, Feature Representations, and Modeling Approaches
AV-Deepfake1M++ mandates modeling both audio and visual streams:
- Audio Modality:
- Front-end: Wav2Vec 2.0 XLSR-53 self-supervised encoder (1024-D output).
- Backbone: AASIST spectro-temporal Graph Attention Network for utterance- and frame-level detection.
- Extensive data augmentations: including codec artifacts, RTP-generated RIRs, band-pass/stop filtering, pitch/time shifts, and additive noise.
- Challenges include detection of extremely short fake segments (partial utterances) and domain mismatch due to varying recording conditions.
- Visual Modality:
- Per-frame handcrafted temporal features, extracted from mouth ROI and background:
- 1. Mouth blurriness (Laplacian variance)
- 2. Non-mouth frame-to-frame mean squared error
- 3. Mouth ROI color-shift in Lab space
- 4. Landmark kinematics (aspect ratio, velocity, acceleration, jerk, jitter)
- Features are processed with a lightweight 1D Temporal Convolutional Network (TCN, ≈124K parameters).
- Key visual confounders involve lighting, head pose, facial occlusions, and backgrounds remaining static under mouth-only manipulations.
This multimodal feature engineering is designed for interpretability, adaptability, and efficient boundary-aware inference.
5. Comparison to AV-Deepfake1M and Other Benchmarks
AV-Deepfake1M++ extends the scope and rigor of earlier deepfake benchmarks:
- Scale: Increases corpus size to over 2 million clips, doubling AV-Deepfake1M’s volume.
- Manipulation Granularity: Supports partial, segment-level manipulations (mean duration 0.33 s), rather than solely full-clip fakes.
- Diversity: Augments talking-head lip-sync attacks with advanced TTS (zero-shot, VITS), novel audio codecs, and environmental reverberation effects.
- Labeling: Four-way class structure (real, audio_modified, visual_modified, both_modified) in contrast to binary labeling prevalent in most deepfake datasets.
- Generalization Benchmarking: Held-out TestA/TestB splits include previously unseen attack types, explicitly designed for stress-testing detector robustness and generalization over novel spoofing strategies.
A plausible implication is that AV-Deepfake1M++ enables research into cross-dataset generalization (e.g., PartialSpoof → AV-Deepfake1M++) and fosters evaluation of models’ capacity under distributional shift.
6. Methodological Implications and Research Directions
The scale and structure of AV-Deepfake1M++ require methodologies that move beyond static, full-clip classifiers:
- Temporal Modeling: Frame-level temporal models (such as TCNs and boundary-aware attention mechanisms) become essential for detecting short, localized manipulations.
- Multimodal Architectures: Optimal performance is achieved by combining self-supervised audio backbones (SSL) with graph attention networks and interpretable, handcrafted visual features, fused via score calibration and max-out rules.
- Best Practices: Include modality-specific Platt sigmoid calibration, sliding-window inference to capture brief segments, loss weighting for BILOU frame-tag heads, and aggressive data augmentation to improve robustness on unseen manipulations.
- Open Problems: AV-Deepfake1M++ spotlights challenges such as generalization to out-of-domain attacks, developing lightweight boundary-aware detectors, and constructing interpretable feature representations for video-based lip-sync forgeries.
AV-Deepfake1M++ constitutes a comprehensive, fine-grained, and challenging audio–visual benchmark. By integrating diverse manipulation types, precise temporal annotation, and substantial real-world variability, the dataset provides a highly discriminative platform for advancing state-of-the-art deepfake detection and localization research (Kukanov et al., 10 Aug 2025).