Hierarchical Swin Transformer Encoder

Updated 22 January 2026

Hierarchical Swin Transformer Encoder is a multi-resolution transformer architecture that jointly captures local details and long-range dependencies using parallel CNN-based branches and transformer blocks.
It employs Residual Swin Transformer Blocks with shifted window self-attention to enhance features, ensuring robust low-level artifact removal and effective global context modeling.
Cross-scale feature fusion via PixelShuffle and adjacent-scale concatenation delivers state-of-the-art performance in compressed image super-resolution, as evidenced by improvements in PSNR and SSIM.

A Hierarchical Swin Transformer Encoder is a multi-resolution transformer architecture designed to enable efficient, expressive feature extraction for high-resolution visual restoration tasks, particularly compressed image super-resolution. It uses a multi-branch, multi-scale pathway structure to jointly capture local details and long-range dependencies at several spatial resolutions, leveraging a series of Swin Transformer blocks integrated with CNN-based downsampling, hierarchical feature fusion, and cross-scale interaction mechanisms (Li et al., 2022).

1. Multi-Branch Hierarchical Architecture

The encoder begins by extracting three parallel feature branches at decreasing spatial resolutions from a low-resolution compressed image $I_l \in \mathbb{R}^{H \times W \times C}$ :

High-scale branch: $F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$
Middle-scale branch: $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$
Low-scale branch: $F_l = \mathrm{Conv}_{3 \times 3, s=2, p=1}(F_m) \in \mathbb{R}^{H/4 \times W/4 \times 60}$

Each branch thus processes the image at a distinct resolution, with 60 embedding channels per feature map. Convolutional strides and kernel sizes are selected to subsample the feature maps hierarchically.

Each scale-specific branch passes through a Feature Enhancement Module (FEM), comprising stacks of Residual Swin Transformer Blocks (RSTBs). The numbers of RSTBs per branch are set to $\{2,4,6\}$ for low, middle, and high scales, respectively.

2. Swin Transformer Block Design and Residual Pathways

2.1 Residual Swin Transformer Block (RSTB)

Within a FEM, a single RSTB consists of $K=6$ consecutive Swin Transformer Layers (STLs), followed by a $3 \times 3$ convolution and a skip connection:

$\begin{aligned} &\text{for }i=1 \ldots K:~F_i = \mathrm{STL}(F_{i-1}) \ &F_{\mathrm{out}} = \mathrm{Conv}_{3 \times 3}(F_K) + F_0 \end{aligned}$

2.2 Swin Transformer Layer (STL)

Each STL processes $X \in \mathbb{R}^{h \times w \times d}$ as follows: $\begin{aligned} Y &= \mathrm{W\text{-}MSA}(\mathrm{LN}(X)) + X \ Z &= \mathrm{SW\text{-}MSA}(\mathrm{LN}(Y)) + Y \ X' &= \mathrm{MLP}(\mathrm{LN}(Z)) + Z \end{aligned}$

$F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 0: window-based multi-head self-attention.
$F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 1: shifted window MSA, using cyclic shift and masking.
$F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 2: feed-forward network, hidden ratio 4, GELU nonlinearity.

2.3 Window-based Self-Attention (W-MSA) and Shifted Window MSA (SW-MSA)

Non-overlapping windows of size $F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 3 (here $F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 4) are used. In each, $F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 5 is projected to queries, keys, and values; $F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 6 attention heads split $F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 7 into $F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 8 channels per head. Scaled dot-product attention with learned relative position bias is used:

$F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}$ 9

In SW-MSA, prior to windowing, features are cyclically shifted by $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 0. A precomputed mask restricts attention to within shifted windows.

3. Cross-Scale Feature Fusion

Following enhancement, the outputs at coarser scales are upsampled using PixelShuffle (sub-pixel convolution, $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 1 upscaling) and concatenated channel-wise with the finer-scale features:

$F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 2

After final refinement at the highest scale, two further sub-pixel upsampling layers (PixelShuffle $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 3 each) followed by a $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 4 convolution reconstruct a $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 5 super-resolved high-resolution output.

Fusion is restricted to adjacent scales only (i.e., no direct low-to-high long skip).

4. Mathematical Properties and Implementation

Strided convolution implements patch merging (downsampling), maintaining channel width 60 at all scales.
Window size $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 6, heads $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 7, d $F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}$ 8 (per token).
PixelShuffle is used throughout for spatial upsampling by rearranging features from channel to space.
Unlike canonical Swin, channel dimension is fixed across scales; only spatial dimension shrinks via stride.

5. Distinctive Design Choices and Comparative Analysis

CNN-based Multi-Scale Branching: Rather than Swin’s patch embeddings plus four-level patch merging (with channel doubling per merge), HST extracts three parallel resolutions using fixed-width convolution, fusing them progressively upward.
Residual Swin Transformer per Scale: Local inductive bias (convolutions) is exploited for robust low-level feature extraction, while global self-attention (shifted windows) enables context modeling within each scale.
PixelShuffle Fusion: Finer-scale branches receive contextual cues from coarser-scale branches, increasing effective receptive field without excessively deepening any single branch.
Performance Gains: In ablation, a three-branch HST encoder exceeds one-branch SwinIR-style backbones by up to +0.15 dB PSNR on JPEG Q=40, with consistent SSIM improvements across compression quality levels.

6. Practical Usage and Performance

The HST encoder, in conjunction with super-resolution pretraining, reached 5th place (PSNR: 23.51 dB) in the AIM 2022 low-quality compressed image super-resolution challenge. The combination of hierarchical convolutional and transformer-based enhancement, narrow-channel multi-scale design, and cross-scale fusion delivers state-of-the-art compressed image restoration with robust generalization and competitive computational efficiency (Li et al., 2022).

7. Broader Context and Implications

The architectural innovations in the Hierarchical Swin Transformer Encoder highlight the trend toward hybrid, multi-scale transformer backbones for challenging low-level vision tasks. By integrating convolutional and attention mechanisms in a hierarchical schema, the encoder realizes both strong locality (necessary for artifact removal) and scalable non-local modeling (critical for reconstruction from highly compressed signals). The empirical evidence underlines the utility of such hybrid designs in domains where purely flat or single-branch transformers struggle to jointly handle heterogeneous, multi-scale distortions. The HST approach is foundational for extensions in other domains—such as video or multimodal hierarchical transformers—where analogous multi-scale fusion and residual self-attention modules have appeared.

Markdown Report Issue Upgrade to Chat

References (1)

HST: Hierarchical Swin Transformer for Compressed Image Super-resolution (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Swin Transformer Encoder.

Hierarchical Swin Transformer Encoder

1. Multi-Branch Hierarchical Architecture

2. Swin Transformer Block Design and Residual Pathways

2.1 Residual Swin Transformer Block (RSTB)

2.2 Swin Transformer Layer (STL)

2.3 Window-based Self-Attention (W-MSA) and Shifted Window MSA (SW-MSA)

3. Cross-Scale Feature Fusion

4. Mathematical Properties and Implementation

5. Distinctive Design Choices and Comparative Analysis

6. Practical Usage and Performance

7. Broader Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hierarchical Swin Transformer Encoder

1. Multi-Branch Hierarchical Architecture

2. Swin Transformer Block Design and Residual Pathways

2.1 Residual Swin Transformer Block (RSTB)

2.2 Swin Transformer Layer (STL)

2.3 Window-based Self-Attention (W-MSA) and Shifted Window MSA (SW-MSA)

3. Cross-Scale Feature Fusion

4. Mathematical Properties and Implementation

5. Distinctive Design Choices and Comparative Analysis

6. Practical Usage and Performance

7. Broader Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research