Multi-size Swin Transformer Block (MSTB)
- Multi-size Swin Transformer Block (MSTB) is a specialized architecture that enhances vision Transformers with parallel multi-scale self-attention for efficient feature extraction.
- It employs multiple Swin-style MSA modules with varied window sizes and shifts to capture diverse spatial scales while reducing computational cost.
- Integrated in frameworks like MSwinSR and MS-UNet, MSTB improves model performance in tasks such as single image super-resolution and wide-angle portrait correction.
The Multi-size Swin Transformer Block (MSTB) is a specialized architecture developed to enhance the representational efficiency and computational performance of vision Transformers, with principal applications in single image super-resolution and wide-angle portrait correction. MSTB leverages multi-window, parallel, and multi-scale self-attention mechanisms based on Swin Transformer principles to capture information at diverse spatial scales within lightweight models. Two prominent MSTB variants are introduced in the contexts of MSwinSR (Zhang et al., 2022) (image super-resolution) and MS-UNet (Zhu et al., 2021) (portraits correction), each exhibiting distinct configurations tailored to their respective domains.
1. Multi-size Swin Transformer Block: Core Architectural Principles
The MSTB, as instantiated in MSwinSR (Zhang et al., 2022), operates on an input tensor . It simultaneously applies four parallel Swin-style Multi-Head Self-Attention (MSA) modules, each differing in window size and shift configuration:
- W-MSA: regular window, size , no shift.
- SW-MSA: shifted window (by pixels), same size .
- W-MSA-½: regular window, size .
- SW-MSA-½: shifted window (by ), size .
Each path begins with LayerNorm, then computes MSA, and concludes with a residual addition of . Their outputs are concatenated along the channel dimension, followed by an additional LayerNorm and a compact MLP. The MLP reduces channel dimension in two steps (), each time applying GELU activation. The MLP output is finally added back to as a residual. This architecture enhances width-wise expressivity by fusing multi-scale contexts via parallel processing.
2. Self-Attention Mechanism and Multi-size Operation
Within each parallel MSA branch, the feature tensor (pre-LN) is linearly projected into using weight matrices (where ). For heads, each head has -dimensional with . Attention in each window is computed as:
where is a learnable relative position bias per head and branch. The attended representations from all heads are concatenated and projected via . The primary means of multi-size operation is realized by using different window partitions (i.e., and , with/without shifts), enabling the block to concurrently process information at varied receptive field scales.
3. Positional Encoding, Normalization, and MLP Design
Positional encoding in MSTB strictly follows the Swin Transformer scheme. Each branch maintains a bias table of size per head for the larger windows, and for the smaller windows. During attention, the appropriate slice of is indexed by the relative coordinates within the window.
Layer normalization is applied before every MSA in each branch and again post-concatenation before the shared MLP. The MLP employs two linear layers with GELU activation, following a channel reduction. These design choices ensure each output maintains the original spatial and channel dimensions for subsequent residual addition.
4. Integration into Network Architectures and Computational Impact
In the MSwinSR framework (Zhang et al., 2022), MSTBs are organized into stages that replace the Residual Swin Transformer Blocks (RSTBs) of SwinIR. Specifically, each stage contains MSTBs, followed by a convolution and a long-skip residual connection. The default model utilizes three stages with depth vector , totaling six MSTBs. This arrangement yields a total of 24 MSA calls—comparable to SwinIR's six Swin layers per RSTB across four RSTBs.
Crucially, placing four parallel MSAs before a single, shared MLP results in a reduction in MLP count, leading to substantial parameter and FLOP reductions. On the CelebA super-resolution benchmark, MSwinSR with MSTB achieves a parameter reduction of (897.2K to 621.9K), FLOP reduction of (4.187G to 3.771G), and a $0.07$ dB gain in PSNR compared to SwinIR (see Table 1).
| Model Variant | Parameter Count | Multi-adds | PSNR (dB) |
|---|---|---|---|
| SwinIR (4×6 layers) | 897,200 | 4.187G | 29.47 |
| MSwinSR 2,2,2 | 621,900 | 3.771G | 29.54 |
5. Multi-Scale Swin Transformer Block in Portrait Correction Networks
In wide-angle portrait correction (Zhu et al., 2021), MSTB denotes a related but distinct block: the Multi-Scale Swin Transformer Block. Here, each MSTB comprises two serial “sub-blocks” that alternate between normal and shifted window partitioning, window-based MSA, residual plus LN, and an MLP with GELU activation. Importantly, each sub-block fuses two query/key/value branches:
- Global window branch derives directly from the windowed (possibly shifted) input.
- Local Dense Connection Module (DCM) branch extracts and through a small residual module employing three depthwise separable convolutions (with dilations ), then windows and flattens the output.
The attention function fuses these scales by computing attention globally over windows while incorporating local structural priors through DCM-based keys and values. Alternating shifted and regular windows ensures cross-window information percolation and alleviates blocking artifacts. The MLP maintains a two-layer, channel-expanding (commonly ) structure. Parameter and FLOP complexity scale quadratically with channel dimension per block, .
6. Hyperparameters and Ablation Findings
In MSwinSR MSTB (Zhang et al., 2022), default settings are:
- Embedding dimension
- Number of heads , so
- Window sizes: for “full” branches; for “half” branches
- MLP path:
For depth ablation with constant total attention calls (24 per model), staging yields superior PSNR/parameter tradeoffs over . In portrait correction MSTB (Zhu et al., 2021), architectural choices are less explicit; values for key hyperparameters (window size, channel count, head number) are not specified, but design follows typical Swin-T-sized backbones. Each MSTB is preceded/followed by standard Swin Transformer network elements.
7. Comparative Analysis and Applicability
MSTBs, across both MSwinSR and MS-UNet, advance Transformer-based vision models by concurrently mixing multi-scale (and, in MS-UNet, local-global) feature extraction within single blocks. By enabling parallel windowed self-attention computations over multiple partitionings, they grant the network enhanced feature aggregation without the parameter or computational penalties of stacking many deep blocks. The MSTB design results in more parameter- and computation-efficient models with either improved or equivalent reconstruction metrics relative to conventional Swin Transformer architectures (Zhang et al., 2022, Zhu et al., 2021). These advantages suggest MSTBs are well suited for real-time or resource-constrained vision tasks requiring rich hierarchical representation synthesis at modest computational cost.