ER-FF++set: Explainable Deepfake Benchmark

Updated 25 October 2025

ER-FF++set is a multimodal benchmark that combines binary deepfake detection with explicit, structured reasoning labels.
It integrates fine-grained spatio-temporal analysis and facial metrics using dual-task supervision for robust training and evaluation.
The dataset enables traceable, human-verifiable model predictions by linking pixel-level artifacts with semantic cues.

The Explainable Reasoning FF++ Benchmark Dataset (ER-FF++set) is a multimodal dataset designed to advance the state of deepfake video detection by integrating explicit, human-verifiable reasoning into both model predictions and evaluation. ER-FF++set provides dual-task supervision—with labels for both detection and the reasoning process—enabling the development and rigorous assessment of systems that deliver not only accurate deepfake classification, but also interpretable, traceable justifications anchored in fine-grained spatio-temporal analysis and facial integrity metrics. The benchmark supports robust experiments on detection performance, explainability, and generalization across forgery methods and datasets (Sun et al., 18 Oct 2025).

1. Dataset Structure and Annotation Paradigm

ER-FF++set is constructed from real and forged videos sourced from FaceForensics++, covering five primary manipulation techniques: DeepFake, Face2Face, FaceSwap, FaceShifter, and NeuralTexture. The dataset includes traditional binary labels (real/fake) and structured, multi-level annotations for every video.

Key annotation components:

Manipulation-type labels: Each video is assigned a target class specifying the forgery technique.
Structured rationales: Logical, model-generated explanations capturing both pixel-level artifacts and higher-order semantic inconsistencies. These rationales are produced using a dedicated LLM (MLLM) assistant, guided by prompt templates customized for each manipulation type.
Localization masks: Region-level annotations mark the suspected forged areas in each frame, providing supervisory signals for spatio-temporal localization.
Facial feature metrics: For every frame, facial landmark coordinates ( $\mathcal{M}_c$ ) and derived dynamics ( $\mathcal{M}_\Delta$ ) are extracted, including blur, color distribution, texture continuity, and boundary artifact amplitude. These are serialized in structured formats such as JSON for direct input to the model's reasoning component.

The annotations support "dual supervision," simultaneously training models to perform video-level detection and to articulate fine-grained chains of reasoning for their decision process.

2. Dual Supervision for Detection and Reasoning

The design philosophy of ER-FF++set is to couple standard detection labels with explicit, rich reasoning targets. Each video thus supports:

Detection supervision: Binary target (real/deepfake), manipulation type classification, and localization mask matching.
Reasoning supervision: Model-generated, structured rationales that detail, with supporting metrics and region pointers, why the model reaches its decision. These rationales are required to map visual cues—such as local or global anomalies and facial dynamics—to the underlying manipulation class. Annotation quality control leverages both prompt engineering (to the MLLM) and posthoc filtering.

This dual supervision framework enables end-to-end training and evaluation regimes that hold models accountable not only for prediction accuracy, but also for the logical veracity and clarity of their explanatory process.

3. Spatio-Temporal Feature Extraction and Reasoning Pipeline

The EDVD-LLaMA framework, proposed for use with ER-FF++set, illustrates the model pipeline for both feature extraction and explainable reasoning.

Spatio-Temporal Subtle Information Tokenization (ST-SIT):

Local branch: Utilizes the DSEncoder (Swin Transformer backbone) to encode 3×3 spatial grids of consecutive frames, yielding robust local, deepfake-sensitive features that capture cross-frame subtleties.
Global branch: Processes global semantic features with SigLiP followed by a Compact Visual Connector, integrating scene-level cues.
Fusion and Projection: Features from both branches are fused via cross-attention (Algorithm 1), normalized, and projected into compact visual token embeddings ( $\mathcal{T}_{\text{vid}}$ ).

Facial Landmark and Metric Extraction:

Step 1: Facial landmark network processes each frame, yielding $\mathcal{M}_c$ .
Step 2: Dynamic integrity metrics computed, e.g.:
- Blur variance: $\sigma(I_t^{(i)}) = \frac{1}{N} \sum_n (\Delta I_{t,n}^{(i)})^2$
- Framewise blur difference: $\Delta \text{Blur}_{t, t+1}^{(i)} = |\sigma(I_t^{(i)}) - \sigma(I_{t+1}^{(i)})|$
- Analogous procedures for other facial properties (color, texture).

Reasoning Synthesis (Fg-MCoT):

Stage 3: A first LLM instance receives video tokens, facial metrics, and a thought prompt to produce a metric-grounded rationale ( $\mathcal{R}_c$ ).
Stage 4: A second LLM takes $\mathcal{R}_c$ $R_{c}$ , fused visual tokens, and a query prompt, returning a final explainable decision in the format
1 2
<think> ...reasoning trace... </think> <answer> real/fake </answer>

This architecture ensures that pixel-level, semantic, and facial cues are systematically integrated into the reasoning process, achieving both accuracy and interpretability.

4. Model Training Objectives and Mathematical Formalism

ER-FF++set is used to supervise both detection and reasoning outputs. The loss formulation combines cross-entropy for detection with likelihood loss for rationale generation. For rationale sequence generation, the objective is

$\mathcal{L}_{\text{rle}} = -\sum_{t=1}^{L_r} \log p_{\text{asst}}\left(r_t^*\,\middle|\,r_{<t}^*,\,\mathcal{T}_{\text{vid}},\,\mathcal{T}_{\text{fac}},\,\mathcal{P}_{\text{tht}}\right)$

where $r_t^*$ is the rationale sequence token at time $t$ , $\mathcal{T}_{\text{vid}}$ is the fused video token, $\mathcal{T}_{\text{fac}}$ the vectorized facial information, and $\mathcal{P}_{\text{tht}}$ the prompt.

Facial quality metrics are strictly defined using Laplacian-based blur detection and other standard operations; all formulas and metric definitions follow conventions in facial forensics and are directly implemented in the dataset construction and training code.

5. Evaluation Protocol and Use Cases

ER-FF++set facilitates evaluation along both standard and novel axes:

Detection accuracy: Standard binary classification metrics and per-class manipulation accuracy.
Localization: Overlap between predicted and ground-truth forgery regions.
Explainability: Rationale quality, measured by the logical and evidentiary linkage between provided structural cues and the system’s classification; typically implemented via template-matching, automatic scoring, or human judgment.

The dataset is explicitly designed for:

Training and benchmarking deepfake detectors with high explainability,
Comparing alternative reasoning modules and MLLMs for traceable predictions,
Promoting generalization to unseen forgery types and cross-dataset scenarios.

6. Implications and Future Research Directions

The dual-supervision, structured-annotation design of ER-FF++set directly addresses the lack of transparency and poor generalization observed in prior "black-box" deepfake detectors (Sun et al., 18 Oct 2025). Its explicit articulation of pixel- and attribute-level anomalies in structured rationales enables models to deliver results suitable for high-stakes domains where actionable, verifiable explanations are critical.

A plausible implication is that future research will leverage ER-FF++set to study the alignment of visual-textual evidence in explainable AI, explore new loss functions promoting evidence consistency, and develop cross-modal models robust to evolving manipulation techniques. The dataset serves as a foundation for further work on resilience, interpretability, and forensic trustworthiness in multimedia security.

While other datasets such as IVY-FAKE (Zhang et al., 1 Jun 2025) and METER (Yang et al., 22 Jul 2025) offer explainable detection for images, videos, audio, or multi-modal forgery settings, ER-FF++set is distinguished by its focus on video deepfakes, formal dual supervision, metric-rich facial annotations, and rigorous prompt-based rationale collection for reasoning processes. It thus occupies a unique role in the landscape of explainable reasoning benchmarks for deepfake detection, particularly in supporting models where traceable reasoning and detection performance are evaluated in concert.