ForensicFlow: Tri-modal Deepfake Detection

Updated 25 November 2025

ForensicFlow is a tri-modal deep neural architecture designed for robust video Deepfake detection by integrating spatial, texture, and frequency cues.
It employs specialized branches—ConvNeXt-tiny for spatial analysis, Swin Transformer-tiny for texture details, and CNN+SE for frequency artifacts—to capture diverse manipulation evidence.
Attention-based temporal pooling and adaptive fusion yield state-of-the-art metrics, achieving an AUC of 0.9752 and an F1-score of 0.9408 on the Celeb-DF (v2) dataset.

ForensicFlow is a tri-modal deep neural architecture designed for robust video Deepfake detection by explicitly integrating information from spatial, texture, and frequency domains. It addresses the fundamental limitation in prior single-stream convolutional architectures, which cannot capture the full spectrum of multi-scale video forgeries. ForensicFlow fuses global visual, fine-grained, and spectral cues through attention mechanisms to achieve high accuracy and resilience against subtle manipulations produced by advanced generative models (Romani, 18 Nov 2025).

1. Architectural Design and Rationale

ForensicFlow processes each video segment as $K$ face-aligned frames, which are simultaneously fed into three modality-specialized branches:

RGB branch (ConvNeXt-tiny): Extracts global spatial inconsistencies such as shading, geometry, and color distribution.
Texture branch (Swin Transformer-tiny): Detects fine-grained, sub-pixel manipulation artifacts typical of Deepfake blending boundaries.
Frequency branch (CNN + Squeeze-and-Excitation): Identifies periodic spectral noise indicative of generative modeling artifacts.

Each pipeline provides a complementary perspective: spatial (RGB), microstructure (texture), and frequency domains are all necessary because Deepfake traces manifest at disparate scales. No single domain is sufficient for robust generalization across the range of synthetic manipulations.

After per-frame feature extraction in each branch, an attention-based temporal pooling mechanism is employed to dynamically weight frames exhibiting the strongest forensic evidence. These aggregated video-level branch embeddings are then fused adaptively via attention, producing a unified representation for subsequent classification.

2. Specialized Feature Extraction Branches

2.1 RGB Branch: ConvNeXt-tiny

The RGB stream accepts a $224 \times 224 \times 3$ face crop. The backbone (ConvNeXt-tiny) comprises four stages, each halving spatial dimensions while doubling channel depth, ultimately yielding a $14 \times 14 \times 768$ feature map. The architecture applies sequential depthwise convolution, layer normalization, GELU activation, and pointwise convolution in each block:

$y = \mathrm{GELU}\bigl(\mathrm{LN}(x * W_{dw} + b_{dw})\bigr) * W_{pw} + b_{pw}$

Global average pooling followed by a linear projection outputs a 512-dimensional embedding per frame:

$\mathbf{f}_{\mathrm{RGB}} = P_{\mathrm{RGB}}\bigl(\mathrm{GAP}(\mathrm{ConvNeXt\_tiny}(I))\bigr) \in \mathbb{R}^{512}$

2.2 Texture Branch: Swin Transformer-tiny

This branch divides the $224 \times 224$ frame into non-overlapping $4 \times 4$ patches (yielding $56 \times 56$ tokens), which are linearly projected to $C=96$ dimensions. Through hierarchical Swin stages utilizing windowed and shifted-window multi-head self-attention, local and global contextual relationships are captured. The attention mechanism within a window of size $M \times M$ is computed as:

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V$

After four stages, the final feature map ( $7 \times 7 \times 768$ ) is average-pooled and projected into a 512-dimensional texture embedding:

$\mathbf{f}_{\mathrm{Tex}} = P_{\mathrm{Tex}}\bigl(\mathrm{GAP}(\mathrm{Swin\_tiny}(I))\bigr) \in \mathbb{R}^{512}$

2.3 Frequency Branch: CNN + Squeeze-and-Excitation (SE)

The frequency branch applies three convolutional, ReLU, and pooling layers to highlight high-frequency residuals, allowing spectral artifact discovery without explicit FFT. After convolution, the SE block recalibrates channel attention:

Squeeze: $z = \mathrm{GAP}(X) \in \mathbb{R}^C$
Excitation: $s = \sigma\left(W_2\,\mathrm{ReLU}(W_1\,z)\right) \in (0,1)^C$
Reweight: $\widetilde{X}_{c} = s_c \cdot X_c$

The pooled and projected output is a 512-dimensional frequency embedding:

$\mathbf{f}_{\mathrm{Freq}} = P_{\mathrm{Freq}}\bigl(\mathrm{SE}(\mathrm{ConvStack}(I))\bigr) \in \mathbb{R}^{512}$

3. Temporal Pooling and Attention-based Fusion

Each branch, given $K$ sequential frames, generates embeddings $\{\mathbf{f}_t\}_{t=1}^K$ . A scoring network $g(\cdot)$ computes logits $u_t=g(\mathbf{f}_t)$ and attention scores:

$\alpha_t = \frac{\exp(u_t)}{\sum_{i=1}^K \exp(u_i)}$

The branch-level video embedding is the attention-weighted sum:

$\mathbf{F}_\mathrm{video} = \sum_{t=1}^K \alpha_t \mathbf{f}_t$

This temporal pooling enables the pipeline to focus on frames containing the strongest forensic evidence.

Final tri-modal fusion is realized via a lightweight attention mechanism over the three aggregated branch embeddings $(\mathbf{F}_{\mathrm{RGB}}, \mathbf{F}_{\mathrm{Tex}}, \mathbf{F}_{\mathrm{Freq}})$ . Fusion logits $v$ are softmax-normalized:

$\beta_m = \frac{\exp(v_m)}{\sum_n \exp(v_n)}, \quad m \in \{\mathrm{RGB}, \mathrm{Tex}, \mathrm{Freq}\}$

The fused representation is:

$\mathbf{F}_{\mathrm{fused}} = \beta_{\mathrm{RGB}}\,\mathbf{F}_{\mathrm{RGB}} + \beta_{\mathrm{Tex}}\,\mathbf{F}_{\mathrm{Tex}} + \beta_{\mathrm{Freq}}\,\mathbf{F}_{\mathrm{Freq}}$

This mechanism permits adaptive re-weighting according to the diagnostic value of each stream for the current video.

4. Training Regime and Protocol

ForensicFlow employs the Celeb-DF (v2) dataset with strict subject partitioning: 5,299 Deepfakes and 712 reals for training, 340 Deepfakes and 178 reals for validation. To address class imbalance and emphasize subtle forgeries, Focal Loss is used:

$L_{\mathrm{FL}}(p_t) = -\alpha\,(1-p_t)^\gamma\,y\log(p_t), \quad \alpha=1.0,\,\gamma=2.0$

Optimization is performed with AdamW (weight decay $1\times 10^{-4}$ , initial learning rate $2 \times 10^{-5}$ , decayed upon each unfreeze stage). The training proceeds via progressive unfreezing:

Epochs 1–3: only classifier and projection heads trained (backbones frozen)
Epochs 4–6: unfreeze final blocks (learning rate halved)
Epochs 7–8: unfreeze mid blocks
Epochs 9–15: full fine-tuning

A single NVIDIA P100 GPU suffices for the full 15-epoch training regime.

5. Empirical Results and Ablation Studies

5.1 Quantitative Performance

On the Celeb-DF (v2) validation set, ForensicFlow achieves:

Metric	Value	95% CI
AUC	0.9752	[0.9636, 0.9848]
F1-score	0.9408	[0.9230, 0.9564]
Accuracy	0.9208

Single-stream and dual-stream baselines at epoch 5 perform markedly worse:

RGB-only: AUC 0.8261, F1 0.8271
RGB+Texture: AUC 0.8309, F1 0.8285
RGB+Freq: AUC 0.8343, F1 0.8223

The tri-modal system thus demonstrates a $\sim$ 15–17% AUC improvement over any simpler alternative, illustrating the necessity of multi-domain evidence integration.

5.2 Ablation Impact

Removing the frequency branch decreases early AUC (0.9752→0.8309).
Removing texture reduces AUC similarly (to 0.8343).
Any two-branch variant cannot exceed 0.83 F1 after five epochs, while the full model surpasses 0.94 F1.
This substantiates that spatial, textural, and frequency streams supply non-redundant forensic information: RGB captures global inconsistencies, texture detects blending seams, and frequency discovers periodic noise.

A plausible implication is that omitting any modality exposes detection to failure against forgeries optimized in that specific domain.

6. Interpretability and Forensic Focus

Grad-CAM applied to the last depthwise convolution of the ConvNeXt-tiny backbone produces attention heatmaps. For genuine samples, this attention is diffuse; for Deepfake frames, the model fixates on artifact regions such as eyes, mouth corners, and facial boundaries, especially where manipulation traces cluster. This suggests the network is not reliant on trivial low-level statistics but attends to genuine manipulation artifacts, distinguishing it from simpler statistical detectors.

7. Significance and Applicability

ForensicFlow demonstrates that robust detection of modern Deepfakes necessitates a multi-modal, attention-driven approach that unifies spatial, textural, and spectral cues (Romani, 18 Nov 2025). Its systematic feature fusion, combined with data-efficient and class-imbalance-tolerant training with Focal Loss, sets a new empirical standard for resilient Deepfake forensics as evidenced by state-of-the-art metrics. Its architectural principles and empirical findings are directly applicable to the design of future multimodal, artifact-sensitive forensic models operating on synthetic media.

PDF Markdown Chat (Pro)

References (1)

ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ForensicFlow.