Papers
Topics
Authors
Recent
2000 character limit reached

Swin Transformer-tiny Texture Branch

Updated 25 November 2025
  • The paper demonstrates that integrating the texture branch with Swin Transformer-tiny significantly enhances Deepfake detection by capturing micro-level, localized texture artifacts, as shown by improved F1 and AUC metrics.
  • It employs a hierarchical transformer-based model using patch partitioning, shifted window self-attention, and patch merging to capture fine-grained spatial discrepancies.
  • Temporal attention pooling and adaptive fusion with RGB and frequency cues ensure robust video-level feature aggregation for superior forensic performance.

The texture branch based on the Swin Transformer-tiny backbone within the ForensicFlow tri-modal forensic framework is designed to expose fine-grained, local blending or skin-texture artifacts characteristic of advanced Deepfake forgeries. In contrast to single-stream CNNs, this branch leverages hierarchical transformer-based modeling—incorporating patch partition, shifted window attention, and patch merging—to capture structure and texture discrepancies with high spatial precision. Its output, distilled to a video-level feature via attention-based temporal pooling, is adaptively fused with complementary RGB and frequency cues for robust video Deepfake detection (Romani, 18 Nov 2025).

1. Architectural Motivation and Backbone Selection

The principal function of the texture branch is the detection of sub-pixel, locally-blended artifacts in manipulated face frames that typically evade global CNN feature extractors. Swin Transformer-tiny, pretrained on ImageNet, is adopted as the backbone due to its ability to construct multi-scale, hierarchical feature representations via localized self-attention operating over non-overlapping windows, which are shifted between blocks to enable cross-region dependency modeling. This configuration enhances sensitivity to subtle and spatially localized irregularities, outperforming global CNN filters at identifying micro-level inconsistencies in texture and blending.

2. Input Preprocessing and Patch Embedding

Each input face frame is resized to 224×224 and normalized using ImageNet statistics. The frame is partitioned into non-overlapping patches of size P×P, with P=4 by default, resulting in N=(224/4)2=3136N = (224/4)^2 = 3136 patches. Each patch, xiRCP2x_i \in \mathbb{R}^{C·P^2}, undergoes linear embedding:

X0=[x1,x2,,xN]We,WeR(CP2)×DX_0 = [x_1, x_2, \ldots, x_N] W_e, \quad W_e \in \mathbb{R}^{(C·P^2) \times D}

where D=96D = 96 (default). The resulting tensor X0RN×DX_0 \in \mathbb{R}^{N \times D} forms the input tokens for downstream transformer blocks.

3. Shifted Window Multi-Head Self-Attention

Core to the Swin architecture are two alternating self-attention blocks per stage:

  • Windowed Multi-Head Self-Attention (W-MSA): Operates locally on non-overlapping M×MM \times M windows (default M=7M=7), limiting computation while providing localized context.
  • Shifted-Window MSA (SW-MSA): Prior to window partitioning, the feature map is cyclically shifted by (M/2,M/2)(\lfloor M/2 \rfloor, \lfloor M/2 \rfloor) positions, enabling tokens on window boundaries to attend across former partition borders.

For each attention head:

Qh=XWhQ,Kh=XWhK,Vh=XWhVQ_h = X W^Q_h, \quad K_h = X W^K_h, \quad V_h = X W^V_h

with WhQ,WhK,WhVRD×(D/H)W^Q_h, W^K_h, W^V_h \in \mathbb{R}^{D \times (D/H)}. The self-attention incorporates a learnable relative position bias BijB_{ij} per window:

A=Softmax(QKTD/H+B)VA = \text{Softmax} \left( \frac{Q K^T}{\sqrt{D/H}} + B \right) V

where AA is of shape (M2)×(D/H)(M^2) \times (D/H). SW-MSA ensures the model can aggregate evidence of local artifacts even when distributed near boundaries of standard attention windows.

4. Swin Transformer-tiny Stagewise Structure

Swin Transformer-tiny is organized in four stages, each progressively reducing spatial resolution while increasing channel depth and attention head count, as detailed below:

Stage Output Resolution Depth LiL_i Embed Dim DiD_i #Heads HiH_i MLP Ratio Dropout
1 56 × 56 2 96 3 4 0
2 28 × 28 2 192 6 4 0
3 14 × 14 6 384 12 4 0
4 7 × 7 2 768 24 4 0

After each W-MSA or SW-MSA block, a sequence of LayerNorm, MSA/MLP, DropPath (with total stochastic depth ≈0.1), and residual connections are applied.

At the start of each stage (except the first), a patch-merging layer reduces the spatial dimensions by 2 and doubles the embedding dimension. Specifically, for neighborhood 2×22 \times 2 regions:

xi,j=Concat(X2i,2j,X2i+1,2j,X2i,2j+1,X2i+1,2j+1),xi,jR4Dix'_{i,j} = \text{Concat}(X_{2i,2j}, X_{2i+1,2j}, X_{2i,2j+1}, X_{2i+1,2j+1}), \quad x'_{i,j} \in \mathbb{R}^{4D_i}

Followed by:

Xi+1=LayerNorm(x)Wdown,WdownR4Di×2DiX_{i+1} = \text{LayerNorm}(x') W_\text{down}, \quad W_\text{down} \in \mathbb{R}^{4D_i \times 2D_i}

5. Feature Post-processing and Temporal Aggregation

The stage 4 output, of shape 7×7×7687\times 7\times 768, undergoes adaptive average pooling to yield a per-frame feature ztR768z_t \in \mathbb{R}^{768}. Linear projection reduces this to ftTexR512f_t^{\text{Tex}} \in \mathbb{R}^{512} for each video frame. Across KK frames, the set {ftTex}t=1K\{f_t^{\text{Tex}}\}_{t=1}^K represents the temporal progression of local texture features.

Temporal aggregation is performed via attention-based pooling, where per-frame weights αt\alpha_t are computed as:

αt=Softmax(uTtanh(Wf[ftRGB;ftTex;ftFreq]))\alpha_t = \text{Softmax}(u^T \tanh(W_f [f_t^{\text{RGB}}; f_t^{\text{Tex}}; f_t^{\text{Freq}}]))

resulting in a pooled texture feature:

fpooledTex=t=1KαtftTexf^{\text{Tex}}_{\text{pooled}} = \sum_{t=1}^K \alpha_t f_t^{\text{Tex}}

6. Adaptive Cross-Branch Fusion

The video-level representations from each branch—fRGBf_{\text{RGB}}, fTexf_{\text{Tex}}, fFreqf_{\text{Freq}} (all in R512\mathbb{R}^{512})—are combined using a “fusion attention” mechanism. This network outputs scalar weights [αRGB,αTex,αFreq][\alpha_{\text{RGB}}, \alpha_{\text{Tex}}, \alpha_{\text{Freq}}] (softmax-normalized):

[αRGB,αTex,αFreq]=Softmax(Wa[fRGB;fTex;fFreq]+ba)[\alpha_{\text{RGB}}, \alpha_{\text{Tex}}, \alpha_{\text{Freq}}] = \text{Softmax}(W_a [f_{\text{RGB}}; f_{\text{Tex}}; f_{\text{Freq}}] + b_a)

The fused representation is then:

ffused=αRGBfRGB+αTexfTex+αFreqfFreqf_{\text{fused}} = \alpha_{\text{RGB}} f_{\text{RGB}} + \alpha_{\text{Tex}} f_{\text{Tex}} + \alpha_{\text{Freq}} f_{\text{Freq}}

This architecture enables dynamic, content-aware weighting of each cue modality at inference.

7. Training Protocols and Empirical Significance

Optimization employs AdamW with weight decay 1×1041\times 10^{-4}, and layer-wise progressive unfreezing during the initial 15 training epochs. The projection and classification head are trained with an initial learning rate of 2×1052\times 10^{-5}, which is halved at each unfreeze stage:

  • Epochs 1–3: only projection & head trainable
  • Epochs 4–6: unfreeze final backbone blocks
  • Epochs 7–8: unfreeze mid-blocks
  • Epochs 9–15: full backbone fine-tuning

Ablation at epoch 5 quantifies the contribution of the texture branch:

Composition F1 Score AUC
RGB + Texture 0.8285 0.8309
Tri-branch (full) 0.9408 0.9752

This validates the synergy between the texture branch and other modalities in the ForensicFlow framework.

8. Context and Forensic Impact

The deployment of the Swin Transformer-tiny as a texture branch within ForensicFlow demonstrates the utility of hierarchical, local-attention transformers in forensic tasks requiring extreme detail sensitivity. The branch's design, combining multi-scale representation, patch merging, and shifted window attention, is specifically tailored to the detection of subtle blending imperfections in Deepfake media that global or frequency-oriented models may neglect. Temporal attention pooling and adaptive fusion further ensure that micro-level evidence is neither overwhelmed nor overlooked in the final prediction, underpinning the system's superior AUC (0.9752), F1 (0.9408), and accuracy (0.9208) on challenging datasets such as Celeb-DF (v2) (Romani, 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Texture Branch (Swin Transformer-tiny).