Swin Transformer-tiny Texture Branch
- The paper demonstrates that integrating the texture branch with Swin Transformer-tiny significantly enhances Deepfake detection by capturing micro-level, localized texture artifacts, as shown by improved F1 and AUC metrics.
- It employs a hierarchical transformer-based model using patch partitioning, shifted window self-attention, and patch merging to capture fine-grained spatial discrepancies.
- Temporal attention pooling and adaptive fusion with RGB and frequency cues ensure robust video-level feature aggregation for superior forensic performance.
The texture branch based on the Swin Transformer-tiny backbone within the ForensicFlow tri-modal forensic framework is designed to expose fine-grained, local blending or skin-texture artifacts characteristic of advanced Deepfake forgeries. In contrast to single-stream CNNs, this branch leverages hierarchical transformer-based modeling—incorporating patch partition, shifted window attention, and patch merging—to capture structure and texture discrepancies with high spatial precision. Its output, distilled to a video-level feature via attention-based temporal pooling, is adaptively fused with complementary RGB and frequency cues for robust video Deepfake detection (Romani, 18 Nov 2025).
1. Architectural Motivation and Backbone Selection
The principal function of the texture branch is the detection of sub-pixel, locally-blended artifacts in manipulated face frames that typically evade global CNN feature extractors. Swin Transformer-tiny, pretrained on ImageNet, is adopted as the backbone due to its ability to construct multi-scale, hierarchical feature representations via localized self-attention operating over non-overlapping windows, which are shifted between blocks to enable cross-region dependency modeling. This configuration enhances sensitivity to subtle and spatially localized irregularities, outperforming global CNN filters at identifying micro-level inconsistencies in texture and blending.
2. Input Preprocessing and Patch Embedding
Each input face frame is resized to 224×224 and normalized using ImageNet statistics. The frame is partitioned into non-overlapping patches of size P×P, with P=4 by default, resulting in patches. Each patch, , undergoes linear embedding:
where (default). The resulting tensor forms the input tokens for downstream transformer blocks.
3. Shifted Window Multi-Head Self-Attention
Core to the Swin architecture are two alternating self-attention blocks per stage:
- Windowed Multi-Head Self-Attention (W-MSA): Operates locally on non-overlapping windows (default ), limiting computation while providing localized context.
- Shifted-Window MSA (SW-MSA): Prior to window partitioning, the feature map is cyclically shifted by positions, enabling tokens on window boundaries to attend across former partition borders.
For each attention head:
with . The self-attention incorporates a learnable relative position bias per window:
where is of shape . SW-MSA ensures the model can aggregate evidence of local artifacts even when distributed near boundaries of standard attention windows.
4. Swin Transformer-tiny Stagewise Structure
Swin Transformer-tiny is organized in four stages, each progressively reducing spatial resolution while increasing channel depth and attention head count, as detailed below:
| Stage | Output Resolution | Depth | Embed Dim | #Heads | MLP Ratio | Dropout |
|---|---|---|---|---|---|---|
| 1 | 56 × 56 | 2 | 96 | 3 | 4 | 0 |
| 2 | 28 × 28 | 2 | 192 | 6 | 4 | 0 |
| 3 | 14 × 14 | 6 | 384 | 12 | 4 | 0 |
| 4 | 7 × 7 | 2 | 768 | 24 | 4 | 0 |
After each W-MSA or SW-MSA block, a sequence of LayerNorm, MSA/MLP, DropPath (with total stochastic depth ≈0.1), and residual connections are applied.
At the start of each stage (except the first), a patch-merging layer reduces the spatial dimensions by 2 and doubles the embedding dimension. Specifically, for neighborhood regions:
Followed by:
5. Feature Post-processing and Temporal Aggregation
The stage 4 output, of shape , undergoes adaptive average pooling to yield a per-frame feature . Linear projection reduces this to for each video frame. Across frames, the set represents the temporal progression of local texture features.
Temporal aggregation is performed via attention-based pooling, where per-frame weights are computed as:
resulting in a pooled texture feature:
6. Adaptive Cross-Branch Fusion
The video-level representations from each branch—, , (all in )—are combined using a “fusion attention” mechanism. This network outputs scalar weights (softmax-normalized):
The fused representation is then:
This architecture enables dynamic, content-aware weighting of each cue modality at inference.
7. Training Protocols and Empirical Significance
Optimization employs AdamW with weight decay , and layer-wise progressive unfreezing during the initial 15 training epochs. The projection and classification head are trained with an initial learning rate of , which is halved at each unfreeze stage:
- Epochs 1–3: only projection & head trainable
- Epochs 4–6: unfreeze final backbone blocks
- Epochs 7–8: unfreeze mid-blocks
- Epochs 9–15: full backbone fine-tuning
Ablation at epoch 5 quantifies the contribution of the texture branch:
| Composition | F1 Score | AUC |
|---|---|---|
| RGB + Texture | 0.8285 | 0.8309 |
| Tri-branch (full) | 0.9408 | 0.9752 |
This validates the synergy between the texture branch and other modalities in the ForensicFlow framework.
8. Context and Forensic Impact
The deployment of the Swin Transformer-tiny as a texture branch within ForensicFlow demonstrates the utility of hierarchical, local-attention transformers in forensic tasks requiring extreme detail sensitivity. The branch's design, combining multi-scale representation, patch merging, and shifted window attention, is specifically tailored to the detection of subtle blending imperfections in Deepfake media that global or frequency-oriented models may neglect. Temporal attention pooling and adaptive fusion further ensure that micro-level evidence is neither overwhelmed nor overlooked in the final prediction, underpinning the system's superior AUC (0.9752), F1 (0.9408), and accuracy (0.9208) on challenging datasets such as Celeb-DF (v2) (Romani, 18 Nov 2025).