Residual Swin Transformer Block Overview

Updated 2 March 2026

Residual Swin Transformer Block (RSTB) is a hierarchical attention-based module that integrates interleaved residual connections and windowed self-attention for improved image restoration and multi-modal fusion.
RSTB employs both intra-layer and block-level residuals alongside convolutional projections to stabilize training and capture fine high-frequency details.
RSTB is central to models like SwinIR, TransEM, and SwinFuse, demonstrating its efficiency in tasks ranging from image restoration to PET reconstruction and infrared-visible fusion.

A Residual Swin Transformer Block (RSTB) is a hierarchical attention-based module that forms a core building block in several transformer-based deep learning models for image restoration, multi-modal image fusion, and inverse problems. RSTB enhances the Swin Transformer architecture by introducing interleaved residual connections at both intra-layer and inter-block granularity and complements window-based local self-attention with convolutional inductive biases. RSTB is central to SwinIR for image restoration (Liang et al., 2021), TransEM for PET reconstruction (Hu et al., 2022), and SwinFuse for infrared-visible fusion (Wang et al., 2022).

1. Architectural Composition and Flow

RSTB is defined by a sequence of Swin Transformer Layers (STLs), each comprising windowed/shifted-window multi-head self-attention (W-MSA/SW-MSA) and a two-layer MLP, both wrapped by pre-layer normalization and residual skip connections. The block is further enveloped by an outer residual path that includes a $3{\times}3$ convolutional projection. The canonical forward pipeline for a single RSTB in SwinIR (Liang et al., 2021) is:

Input: Feature map $X_0\in\mathbb{R}^{H\times W\times C}$ .
Processing:
- Partition $X$ into $M{\times}M$ non-overlapping windows: $\{X_0^p\}_{p=1}^{N_w}$ .
- For $j=1,\dots,L$ STLs, alternate:
- Cyclically shift $X_{j-1}$ by $(\lfloor M/2\rfloor,\lfloor M/2\rfloor)$ for SW-MSA.
- Window-wise multi-head self-attention and residual-MLP sublayers, with LayerNorm.
- Intra-layer equations:
$\begin{aligned} Y_j & = X_{j-1} + \mathrm{W\text{-}MSA}(\mathrm{LN}(X_{j-1})), \ Z_j & = Y_j + \mathrm{MLP}(\mathrm{LN}(Y_j)), \ X_j & = Z_j \end{aligned}$
Block-level Processing:
- Final $3{\times}3$ convolution: $Y = \operatorname{Conv}_{3\times 3}(X_L)$ .
- Add outer residual: $X_{out} = Y + X_0$ .

This model is widely adopted, with each task specifying key hyperparameters such as the number of STLs per RSTB, feature channel width, window size, number of attention heads, and MLP expansion ratio (Liang et al., 2021, Hu et al., 2022, Wang et al., 2022).

2. Mathematical Framework

The RSTB’s operation is mathematically formalized as follows:

Window Partition/Unpartition: Transform $X\in\mathbb{R}^{H\times W\times C}$ into $N_w$ windows $X^p\in\mathbb{R}^{M^2\times C}$ and reconstruct the full map post-attention.
Multi-head Self-Attention in a Window:
- For each head $t=1\dots h$ :
$\begin{aligned} Q^p_t &= X^pW^t_Q, \quad K^p_t = X^pW^t_K, \quad V^p_t = X^pW^t_V \ A^p_t &= \mathrm{Softmax}\Big(Q^p_t (K^p_t)^\top /\sqrt{d} + B_t\Big)\ \mathrm{Head}^p_t &= A^p_t V^p_t \end{aligned}$ - Output from the window: $\operatorname{Concat}_t(\mathrm{Head}^p_t) W_O$ .
MLP Block (applied per-feature):

$\mathrm{MLP}(U) = W_2\,\mathrm{GELU}(W_1 U) + b_2$

with $W_1\in\mathbb{R}^{C\times rC}$ , $W_2\in\mathbb{R}^{rC\times C}$ , and $r$ the expansion factor.

Shifted Windowing: Alternates between regular windowing and feature-map shifts of $(\lfloor M/2\rfloor,\lfloor M/2\rfloor)$ to enable cross-window attention. Attention is masked to prevent inter-window information leakage.

3. Residual Design Rationale

Residual connections are employed at two granularities within RSTB:

Intra-layer residuals (within each STL): Facilitate local gradient propagation and stabilize deep network training.
Block-level residual: The output of the $L$ STLs (after local convolution) is added back to the original RSTB input. This block-wise structure enables the block to focus on learning residual corrections relative to an identity mapping, targeting high-frequency details while maintaining low-frequency signal integrity.
Convolution prior to block residual: The $3{\times}3$ convolution restores local translational equivariance, modulates feature statistics, and makes the feature distribution more stable for subsequent processing. This is particularly important outside the strictly permutation-invariant transformer paradigm (Liang et al., 2021).

4. Window-based Self-Attention and Shifted Windows

The core attention mechanism in RSTB restricts self-attention computations to local, non-overlapping windows, reducing computational complexity from $O((HW)^2)$ (global attention) to $O(HWN^2)$ for window size $N$ .

Windowed-Self Attention (W-MSA): Tokens within $M{\times}M$ partitions attend only to other tokens in the same window.
Shifted-Window Self-Attention (SW-MSA): Every other layer applies a spatial shift to the feature map before forming windows, which enhances cross-window feature mixing over depth. To prevent information leakage across windows during the cyclic shift, attention logits are masked accordingly.
Window size, feature dimension, and number of attention heads are adapted to domain needs (e.g., $M=8$ in SwinIR, $M=4$ in TransEM, $N=7$ in SwinFuse) (Liang et al., 2021, Hu et al., 2022, Wang et al., 2022).

5. Implementation Variants Across Applications

RSTB has been adapted to address various image-related tasks, demonstrating cross-domain utility:

Model (Paper)	Window Size	RSTBs / Block	Feature Dim $C$	Attention Heads per RSTB	Deep Block Structure
SwinIR (Liang et al., 2021)	8 (SR, DN), 7 (JPEG)	6	180/60	6	6 STLs + Conv + Residual
TransEM (Hu et al., 2022)	4	1 per EM step	varies	4 or 8	Conv + 1 STL + Conv + Residual
SwinFuse (Wang et al., 2022)	7	3 (best)	96	1/2/4	6 STLs + Residual (no conv)

In SwinIR, RSTB forms the deep feature extraction stage for image super-resolution, denoising, and artifact reduction. In TransEM, a single RSTB with shallow and deep feature fusion acts as a learned regularizer within an iteratively unrolled EM reconstruction algorithm for PET. SwinFuse uses stacked RSTBs as a global feature encoder to aggregate context and modality-specific patterns for infrared-visible fusion.

6. Theoretical and Practical Implications

The structural advantages of RSTB include:

Scalable depth: Residual pathways support deep stacking (dozens of STLs and RSTBs) without gradient attenuation.
Efficient attention: Windowed and shifted-window attention balance locality and global context, maintaining $O(HWN^2)$ complexity and enabling modeling over large images.
Translation-awareness: End-block convolutions compensate for the lack of spatial equivariance in transformers.
Task adaptability: Hyperparameterization (window size, head count, channel width, stack depth) tailors the block for low-level restoration, fusion, or inverse problem constraints.

The recurrent presence of RSTB in diverse SOTA models corroborates its effectiveness in settings previously dominated by CNNs, demonstrating Matching or surpassing state-of-the-art restoration quality with fewer parameters (up to 67% reduction reported in SwinIR) and robust generalization across domains (Liang et al., 2021, Hu et al., 2022, Wang et al., 2022).

7. Summary and Comparative Analysis

RSTB synthesizes the merits of hierarchical vision transformers and residual convolutional networks, providing a modular, reproducible, and extensible template for a wide array of vision and imaging tasks. Empirical ablations on the number of RSTBs, window size, and attention heads document a trade-off landscape between computational cost and global information integration (Liang et al., 2021, Hu et al., 2022, Wang et al., 2022). Its integration into leading models for image restoration, PET reconstruction, and multi-modal fusion exemplifies its generality and impact on modern attention-based architectures.

Markdown Report Issue Upgrade to Chat

References (3)

SwinIR: Image Restoration Using Swin Transformer (2021)

TransEM:Residual Swin-Transformer based regularized PET image reconstruction (2022)

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Swin Transformer Block (RSTB).

Residual Swin Transformer Block Overview

1. Architectural Composition and Flow

2. Mathematical Framework

3. Residual Design Rationale

4. Window-based Self-Attention and Shifted Windows

5. Implementation Variants Across Applications

6. Theoretical and Practical Implications

7. Summary and Comparative Analysis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Residual Swin Transformer Block Overview

1. Architectural Composition and Flow

2. Mathematical Framework

3. Residual Design Rationale

4. Window-based Self-Attention and Shifted Windows

5. Implementation Variants Across Applications

6. Theoretical and Practical Implications

7. Summary and Comparative Analysis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research