Swin Transformer-tiny Texture Branch

Updated 25 November 2025

The paper demonstrates that integrating the texture branch with Swin Transformer-tiny significantly enhances Deepfake detection by capturing micro-level, localized texture artifacts, as shown by improved F1 and AUC metrics.
It employs a hierarchical transformer-based model using patch partitioning, shifted window self-attention, and patch merging to capture fine-grained spatial discrepancies.
Temporal attention pooling and adaptive fusion with RGB and frequency cues ensure robust video-level feature aggregation for superior forensic performance.

The texture branch based on the Swin Transformer-tiny backbone within the ForensicFlow tri-modal forensic framework is designed to expose fine-grained, local blending or skin-texture artifacts characteristic of advanced Deepfake forgeries. In contrast to single-stream CNNs, this branch leverages hierarchical transformer-based modeling—incorporating patch partition, shifted window attention, and patch merging—to capture structure and texture discrepancies with high spatial precision. Its output, distilled to a video-level feature via attention-based temporal pooling, is adaptively fused with complementary RGB and frequency cues for robust video Deepfake detection (Romani, 18 Nov 2025).

1. Architectural Motivation and Backbone Selection

The principal function of the texture branch is the detection of sub-pixel, locally-blended artifacts in manipulated face frames that typically evade global CNN feature extractors. Swin Transformer-tiny, pretrained on ImageNet, is adopted as the backbone due to its ability to construct multi-scale, hierarchical feature representations via localized self-attention operating over non-overlapping windows, which are shifted between blocks to enable cross-region dependency modeling. This configuration enhances sensitivity to subtle and spatially localized irregularities, outperforming global CNN filters at identifying micro-level inconsistencies in texture and blending.

2. Input Preprocessing and Patch Embedding

Each input face frame is resized to 224×224 and normalized using ImageNet statistics. The frame is partitioned into non-overlapping patches of size P×P, with P=4 by default, resulting in $N = (224/4)^2 = 3136$ patches. Each patch, $x_i \in \mathbb{R}^{C·P^2}$ , undergoes linear embedding:

$X_0 = [x_1, x_2, \ldots, x_N] W_e, \quad W_e \in \mathbb{R}^{(C·P^2) \times D}$

where $D = 96$ (default). The resulting tensor $X_0 \in \mathbb{R}^{N \times D}$ forms the input tokens for downstream transformer blocks.

3. Shifted Window Multi-Head Self-Attention

Core to the Swin architecture are two alternating self-attention blocks per stage:

Windowed Multi-Head Self-Attention (W-MSA): Operates locally on non-overlapping $M \times M$ windows (default $M=7$ ), limiting computation while providing localized context.
Shifted-Window MSA (SW-MSA): Prior to window partitioning, the feature map is cyclically shifted by $(\lfloor M/2 \rfloor, \lfloor M/2 \rfloor)$ positions, enabling tokens on window boundaries to attend across former partition borders.

For each attention head:

$Q_h = X W^Q_h, \quad K_h = X W^K_h, \quad V_h = X W^V_h$

with $W^Q_h, W^K_h, W^V_h \in \mathbb{R}^{D \times (D/H)}$ . The self-attention incorporates a learnable relative position bias $B_{ij}$ per window:

$A = \text{Softmax} \left( \frac{Q K^T}{\sqrt{D/H}} + B \right) V$

where $A$ is of shape $(M^2) \times (D/H)$ . SW-MSA ensures the model can aggregate evidence of local artifacts even when distributed near boundaries of standard attention windows.

4. Swin Transformer-tiny Stagewise Structure

Swin Transformer-tiny is organized in four stages, each progressively reducing spatial resolution while increasing channel depth and attention head count, as detailed below:

Stage	Output Resolution	Depth $L_i$	Embed Dim $D_i$	#Heads $H_i$	MLP Ratio
1	56 × 56	2	96	3	4
2	28 × 28	2	192	6	4
3	14 × 14	6	384	12	4
4	7 × 7	2	768	24	4

After each W-MSA or SW-MSA block, a sequence of LayerNorm, MSA/MLP, DropPath (with total stochastic depth ≈0.1), and residual connections are applied.

At the start of each stage (except the first), a patch-merging layer reduces the spatial dimensions by 2 and doubles the embedding dimension. Specifically, for neighborhood $2 \times 2$ regions:

$x'_{i,j} = \text{Concat}(X_{2i,2j}, X_{2i+1,2j}, X_{2i,2j+1}, X_{2i+1,2j+1}), \quad x'_{i,j} \in \mathbb{R}^{4D_i}$

Followed by:

$X_{i+1} = \text{LayerNorm}(x') W_\text{down}, \quad W_\text{down} \in \mathbb{R}^{4D_i \times 2D_i}$

5. Feature Post-processing and Temporal Aggregation

The stage 4 output, of shape $7\times 7\times 768$ , undergoes adaptive average pooling to yield a per-frame feature $z_t \in \mathbb{R}^{768}$ . Linear projection reduces this to $f_t^{\text{Tex}} \in \mathbb{R}^{512}$ for each video frame. Across $K$ frames, the set $\{f_t^{\text{Tex}}\}_{t=1}^K$ represents the temporal progression of local texture features.

Temporal aggregation is performed via attention-based pooling, where per-frame weights $\alpha_t$ are computed as:

$\alpha_t = \text{Softmax}(u^T \tanh(W_f [f_t^{\text{RGB}}; f_t^{\text{Tex}}; f_t^{\text{Freq}}]))$

resulting in a pooled texture feature:

$f^{\text{Tex}}_{\text{pooled}} = \sum_{t=1}^K \alpha_t f_t^{\text{Tex}}$

6. Adaptive Cross-Branch Fusion

The video-level representations from each branch— $f_{\text{RGB}}$ , $f_{\text{Tex}}$ , $f_{\text{Freq}}$ (all in $\mathbb{R}^{512}$ )—are combined using a “fusion attention” mechanism. This network outputs scalar weights $[\alpha_{\text{RGB}}, \alpha_{\text{Tex}}, \alpha_{\text{Freq}}]$ (softmax-normalized):

$[\alpha_{\text{RGB}}, \alpha_{\text{Tex}}, \alpha_{\text{Freq}}] = \text{Softmax}(W_a [f_{\text{RGB}}; f_{\text{Tex}}; f_{\text{Freq}}] + b_a)$

The fused representation is then:

$f_{\text{fused}} = \alpha_{\text{RGB}} f_{\text{RGB}} + \alpha_{\text{Tex}} f_{\text{Tex}} + \alpha_{\text{Freq}} f_{\text{Freq}}$

This architecture enables dynamic, content-aware weighting of each cue modality at inference.

7. Training Protocols and Empirical Significance

Optimization employs AdamW with weight decay $1\times 10^{-4}$ , and layer-wise progressive unfreezing during the initial 15 training epochs. The projection and classification head are trained with an initial learning rate of $2\times 10^{-5}$ , which is halved at each unfreeze stage:

Epochs 1–3: only projection & head trainable
Epochs 4–6: unfreeze final backbone blocks
Epochs 7–8: unfreeze mid-blocks
Epochs 9–15: full backbone fine-tuning

Ablation at epoch 5 quantifies the contribution of the texture branch:

Composition	F1 Score	AUC
RGB + Texture	0.8285	0.8309
Tri-branch (full)	0.9408	0.9752

This validates the synergy between the texture branch and other modalities in the ForensicFlow framework.

8. Context and Forensic Impact

The deployment of the Swin Transformer-tiny as a texture branch within ForensicFlow demonstrates the utility of hierarchical, local-attention transformers in forensic tasks requiring extreme detail sensitivity. The branch's design, combining multi-scale representation, patch merging, and shifted window attention, is specifically tailored to the detection of subtle blending imperfections in Deepfake media that global or frequency-oriented models may neglect. Temporal attention pooling and adaptive fusion further ensure that micro-level evidence is neither overwhelmed nor overlooked in the final prediction, underpinning the system's superior AUC (0.9752), F1 (0.9408), and accuracy (0.9208) on challenging datasets such as Celeb-DF (v2) (Romani, 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Texture Branch (Swin Transformer-tiny).