Multi-size Swin Transformer Block (MSTB)

Updated 22 February 2026

Multi-size Swin Transformer Block (MSTB) is a specialized architecture that enhances vision Transformers with parallel multi-scale self-attention for efficient feature extraction.
It employs multiple Swin-style MSA modules with varied window sizes and shifts to capture diverse spatial scales while reducing computational cost.
Integrated in frameworks like MSwinSR and MS-UNet, MSTB improves model performance in tasks such as single image super-resolution and wide-angle portrait correction.

The Multi-size Swin Transformer Block (MSTB) is a specialized architecture developed to enhance the representational efficiency and computational performance of vision Transformers, with principal applications in single image super-resolution and wide-angle portrait correction. MSTB leverages multi-window, parallel, and multi-scale self-attention mechanisms based on Swin Transformer principles to capture information at diverse spatial scales within lightweight models. Two prominent MSTB variants are introduced in the contexts of MSwinSR (Zhang et al., 2022) (image super-resolution) and MS-UNet (Zhu et al., 2021) (portraits correction), each exhibiting distinct configurations tailored to their respective domains.

1. Multi-size Swin Transformer Block: Core Architectural Principles

The MSTB, as instantiated in MSwinSR (Zhang et al., 2022), operates on an input tensor $F_0 \in \mathbb{R}^{H \times W \times C}$ . It simultaneously applies four parallel Swin-style Multi-Head Self-Attention (MSA) modules, each differing in window size and shift configuration:

W-MSA: regular window, size $M \times M$ , no shift.
SW-MSA: shifted window (by $\lfloor M/2 \rfloor$ pixels), same size $M \times M$ .
W-MSA-½: regular window, size $(M/2) \times (M/2)$ .
SW-MSA-½: shifted window (by $\lfloor (M/2)/2 \rfloor$ ), size $(M/2) \times (M/2)$ .

Each path begins with LayerNorm, then computes MSA, and concludes with a residual addition of $F_0$ . Their outputs $F_1, F_2, F_3, F_4$ are concatenated along the channel dimension, followed by an additional LayerNorm and a compact MLP. The MLP reduces channel dimension in two steps ( $4C \to 2C \to C$ ), each time applying GELU activation. The MLP output is finally added back to $F_0$ as a residual. This architecture enhances width-wise expressivity by fusing multi-scale contexts via parallel processing.

2. Self-Attention Mechanism and Multi-size Operation

Within each parallel MSA branch, the feature tensor $X$ (pre-LN) is linearly projected into $Q, K, V$ using weight matrices $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ (where $d = C$ ). For $h$ heads, each head has $d_k = d / h$ -dimensional $Q^\ell, K^\ell, V^\ell$ with $\ell=1, \ldots, h$ . Attention in each window is computed as:

$\text{Attention}(Q^\ell, K^\ell, V^\ell) = \operatorname{softmax}\left( \frac{Q^\ell (K^\ell)^T}{\sqrt{d_k}} + B^\ell \right) V^\ell$

where $B^\ell$ is a learnable relative position bias per head and branch. The attended representations from all heads are concatenated and projected via $W_O$ . The primary means of multi-size operation is realized by using different window partitions (i.e., $M \times M$ and $(M/2) \times (M/2)$ , with/without shifts), enabling the block to concurrently process information at varied receptive field scales.

3. Positional Encoding, Normalization, and MLP Design

Positional encoding in MSTB strictly follows the Swin Transformer scheme. Each branch maintains a bias table of size $(2M-1) \times (2M-1)$ per head for the larger windows, and $(M-1) \times (M-1)$ for the smaller windows. During attention, the appropriate slice of $B^\ell$ is indexed by the relative coordinates within the window.

Layer normalization is applied before every MSA in each branch and again post-concatenation before the shared MLP. The MLP employs two linear layers with GELU activation, following a $4C \to 2C \to C$ channel reduction. These design choices ensure each output maintains the original spatial and channel dimensions for subsequent residual addition.

4. Integration into Network Architectures and Computational Impact

In the MSwinSR framework (Zhang et al., 2022), MSTBs are organized into stages that replace the Residual Swin Transformer Blocks (RSTBs) of SwinIR. Specifically, each stage contains $L$ MSTBs, followed by a $3 \times 3$ convolution and a long-skip residual connection. The default model utilizes three stages with depth vector $[2,2,2]$ , totaling six MSTBs. This arrangement yields a total of 24 MSA calls—comparable to SwinIR's six Swin layers per RSTB across four RSTBs.

Crucially, placing four parallel MSAs before a single, shared MLP results in a $4\times$ reduction in MLP count, leading to substantial parameter and FLOP reductions. On the CelebA $4\times$ super-resolution benchmark, MSwinSR with MSTB achieves a parameter reduction of $30.7\%$ (897.2K to 621.9K), FLOP reduction of $9.94\%$ (4.187G to 3.771G), and a $0.07$ dB gain in PSNR compared to SwinIR (see Table 1).

Model Variant	Parameter Count	Multi-adds	PSNR (dB)
SwinIR (4×6 layers)	897,200	4.187G	29.47
MSwinSR 2,2,2	621,900	3.771G	29.54

5. Multi-Scale Swin Transformer Block in Portrait Correction Networks

In wide-angle portrait correction (Zhu et al., 2021), MSTB denotes a related but distinct block: the Multi-Scale Swin Transformer Block. Here, each MSTB comprises two serial “sub-blocks” that alternate between normal and shifted window partitioning, window-based MSA, residual plus LN, and an MLP with GELU activation. Importantly, each sub-block fuses two query/key/value branches:

Global window branch derives $Q$ directly from the windowed (possibly shifted) input.
Local Dense Connection Module (DCM) branch extracts $K$ and $V$ through a small residual module employing three depthwise separable $3 \times 3$ convolutions (with dilations $\{1,2,3\}$ ), then windows and flattens the output.

The attention function fuses these scales by computing attention globally over windows while incorporating local structural priors through DCM-based keys and values. Alternating shifted and regular windows ensures cross-window information percolation and alleviates blocking artifacts. The MLP maintains a two-layer, channel-expanding (commonly $r=4$ ) structure. Parameter and FLOP complexity scale quadratically with channel dimension per block, $\mathcal{O}(C^2)$ .

6. Hyperparameters and Ablation Findings

In MSwinSR MSTB (Zhang et al., 2022), default settings are:

Embedding dimension $C = 60$
Number of heads $h = 6$ , so $d_k = 10$
Window sizes: $M=8$ for “full” branches; $M/2=4$ for “half” branches
MLP path: $4C \to 2C \to C$

For depth ablation with constant total attention calls (24 per model), $[2,2,2]$ staging yields superior PSNR/parameter tradeoffs over $[1,1,1,1,1,1]$ . In portrait correction MSTB (Zhu et al., 2021), architectural choices are less explicit; values for key hyperparameters (window size, channel count, head number) are not specified, but design follows typical Swin-T-sized backbones. Each MSTB is preceded/followed by standard Swin Transformer network elements.

7. Comparative Analysis and Applicability

MSTBs, across both MSwinSR and MS-UNet, advance Transformer-based vision models by concurrently mixing multi-scale (and, in MS-UNet, local-global) feature extraction within single blocks. By enabling parallel windowed self-attention computations over multiple partitionings, they grant the network enhanced feature aggregation without the parameter or computational penalties of stacking many deep blocks. The MSTB design results in more parameter- and computation-efficient models with either improved or equivalent reconstruction metrics relative to conventional Swin Transformer architectures (Zhang et al., 2022, Zhu et al., 2021). These advantages suggest MSTBs are well suited for real-time or resource-constrained vision tasks requiring rich hierarchical representation synthesis at modest computational cost.

Markdown Report Issue Upgrade to Chat

References (2)

Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer (2022)

Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-size Swin Transformer Block (MSTB).