ForensicFlow: Tri-modal Deepfake Detection
- ForensicFlow is a tri-modal deep neural architecture designed for robust video Deepfake detection by integrating spatial, texture, and frequency cues.
- It employs specialized branches—ConvNeXt-tiny for spatial analysis, Swin Transformer-tiny for texture details, and CNN+SE for frequency artifacts—to capture diverse manipulation evidence.
- Attention-based temporal pooling and adaptive fusion yield state-of-the-art metrics, achieving an AUC of 0.9752 and an F1-score of 0.9408 on the Celeb-DF (v2) dataset.
ForensicFlow is a tri-modal deep neural architecture designed for robust video Deepfake detection by explicitly integrating information from spatial, texture, and frequency domains. It addresses the fundamental limitation in prior single-stream convolutional architectures, which cannot capture the full spectrum of multi-scale video forgeries. ForensicFlow fuses global visual, fine-grained, and spectral cues through attention mechanisms to achieve high accuracy and resilience against subtle manipulations produced by advanced generative models (Romani, 18 Nov 2025).
1. Architectural Design and Rationale
ForensicFlow processes each video segment as face-aligned frames, which are simultaneously fed into three modality-specialized branches:
- RGB branch (ConvNeXt-tiny): Extracts global spatial inconsistencies such as shading, geometry, and color distribution.
- Texture branch (Swin Transformer-tiny): Detects fine-grained, sub-pixel manipulation artifacts typical of Deepfake blending boundaries.
- Frequency branch (CNN + Squeeze-and-Excitation): Identifies periodic spectral noise indicative of generative modeling artifacts.
Each pipeline provides a complementary perspective: spatial (RGB), microstructure (texture), and frequency domains are all necessary because Deepfake traces manifest at disparate scales. No single domain is sufficient for robust generalization across the range of synthetic manipulations.
After per-frame feature extraction in each branch, an attention-based temporal pooling mechanism is employed to dynamically weight frames exhibiting the strongest forensic evidence. These aggregated video-level branch embeddings are then fused adaptively via attention, producing a unified representation for subsequent classification.
2. Specialized Feature Extraction Branches
2.1 RGB Branch: ConvNeXt-tiny
The RGB stream accepts a face crop. The backbone (ConvNeXt-tiny) comprises four stages, each halving spatial dimensions while doubling channel depth, ultimately yielding a feature map. The architecture applies sequential depthwise convolution, layer normalization, GELU activation, and pointwise convolution in each block:
Global average pooling followed by a linear projection outputs a 512-dimensional embedding per frame:
2.2 Texture Branch: Swin Transformer-tiny
This branch divides the frame into non-overlapping patches (yielding tokens), which are linearly projected to dimensions. Through hierarchical Swin stages utilizing windowed and shifted-window multi-head self-attention, local and global contextual relationships are captured. The attention mechanism within a window of size is computed as:
After four stages, the final feature map () is average-pooled and projected into a 512-dimensional texture embedding:
2.3 Frequency Branch: CNN + Squeeze-and-Excitation (SE)
The frequency branch applies three convolutional, ReLU, and pooling layers to highlight high-frequency residuals, allowing spectral artifact discovery without explicit FFT. After convolution, the SE block recalibrates channel attention:
- Squeeze:
- Excitation:
- Reweight:
The pooled and projected output is a 512-dimensional frequency embedding:
3. Temporal Pooling and Attention-based Fusion
Each branch, given sequential frames, generates embeddings . A scoring network computes logits and attention scores:
The branch-level video embedding is the attention-weighted sum:
This temporal pooling enables the pipeline to focus on frames containing the strongest forensic evidence.
Final tri-modal fusion is realized via a lightweight attention mechanism over the three aggregated branch embeddings . Fusion logits are softmax-normalized:
The fused representation is:
This mechanism permits adaptive re-weighting according to the diagnostic value of each stream for the current video.
4. Training Regime and Protocol
ForensicFlow employs the Celeb-DF (v2) dataset with strict subject partitioning: 5,299 Deepfakes and 712 reals for training, 340 Deepfakes and 178 reals for validation. To address class imbalance and emphasize subtle forgeries, Focal Loss is used:
Optimization is performed with AdamW (weight decay , initial learning rate , decayed upon each unfreeze stage). The training proceeds via progressive unfreezing:
- Epochs 1–3: only classifier and projection heads trained (backbones frozen)
- Epochs 4–6: unfreeze final blocks (learning rate halved)
- Epochs 7–8: unfreeze mid blocks
- Epochs 9–15: full fine-tuning
A single NVIDIA P100 GPU suffices for the full 15-epoch training regime.
5. Empirical Results and Ablation Studies
5.1 Quantitative Performance
On the Celeb-DF (v2) validation set, ForensicFlow achieves:
| Metric | Value | 95% CI |
|---|---|---|
| AUC | 0.9752 | [0.9636, 0.9848] |
| F1-score | 0.9408 | [0.9230, 0.9564] |
| Accuracy | 0.9208 |
Single-stream and dual-stream baselines at epoch 5 perform markedly worse:
- RGB-only: AUC 0.8261, F1 0.8271
- RGB+Texture: AUC 0.8309, F1 0.8285
- RGB+Freq: AUC 0.8343, F1 0.8223
The tri-modal system thus demonstrates a 15–17% AUC improvement over any simpler alternative, illustrating the necessity of multi-domain evidence integration.
5.2 Ablation Impact
- Removing the frequency branch decreases early AUC (0.9752→0.8309).
- Removing texture reduces AUC similarly (to 0.8343).
- Any two-branch variant cannot exceed 0.83 F1 after five epochs, while the full model surpasses 0.94 F1.
- This substantiates that spatial, textural, and frequency streams supply non-redundant forensic information: RGB captures global inconsistencies, texture detects blending seams, and frequency discovers periodic noise.
A plausible implication is that omitting any modality exposes detection to failure against forgeries optimized in that specific domain.
6. Interpretability and Forensic Focus
Grad-CAM applied to the last depthwise convolution of the ConvNeXt-tiny backbone produces attention heatmaps. For genuine samples, this attention is diffuse; for Deepfake frames, the model fixates on artifact regions such as eyes, mouth corners, and facial boundaries, especially where manipulation traces cluster. This suggests the network is not reliant on trivial low-level statistics but attends to genuine manipulation artifacts, distinguishing it from simpler statistical detectors.
7. Significance and Applicability
ForensicFlow demonstrates that robust detection of modern Deepfakes necessitates a multi-modal, attention-driven approach that unifies spatial, textural, and spectral cues (Romani, 18 Nov 2025). Its systematic feature fusion, combined with data-efficient and class-imbalance-tolerant training with Focal Loss, sets a new empirical standard for resilient Deepfake forensics as evidenced by state-of-the-art metrics. Its architectural principles and empirical findings are directly applicable to the design of future multimodal, artifact-sensitive forensic models operating on synthetic media.