Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Swin Transformer Encoder

Updated 22 January 2026
  • Hierarchical Swin Transformer Encoder is a multi-resolution transformer architecture that jointly captures local details and long-range dependencies using parallel CNN-based branches and transformer blocks.
  • It employs Residual Swin Transformer Blocks with shifted window self-attention to enhance features, ensuring robust low-level artifact removal and effective global context modeling.
  • Cross-scale feature fusion via PixelShuffle and adjacent-scale concatenation delivers state-of-the-art performance in compressed image super-resolution, as evidenced by improvements in PSNR and SSIM.

A Hierarchical Swin Transformer Encoder is a multi-resolution transformer architecture designed to enable efficient, expressive feature extraction for high-resolution visual restoration tasks, particularly compressed image super-resolution. It uses a multi-branch, multi-scale pathway structure to jointly capture local details and long-range dependencies at several spatial resolutions, leveraging a series of Swin Transformer blocks integrated with CNN-based downsampling, hierarchical feature fusion, and cross-scale interaction mechanisms (Li et al., 2022).

1. Multi-Branch Hierarchical Architecture

The encoder begins by extracting three parallel feature branches at decreasing spatial resolutions from a low-resolution compressed image Il∈RH×W×CI_l \in \mathbb{R}^{H \times W \times C}:

  • High-scale branch: Fh=Conv7×7,s=1,p=3(Il)∈RH×W×60F_h = \mathrm{Conv}_{7 \times 7, s=1, p=3}(I_l) \in \mathbb{R}^{H \times W \times 60}
  • Middle-scale branch: Fm=Conv5×5,s=2,p=2(Il)∈RH/2×W/2×60F_m = \mathrm{Conv}_{5 \times 5, s=2, p=2}(I_l) \in \mathbb{R}^{H/2 \times W/2 \times 60}
  • Low-scale branch: Fl=Conv3×3,s=2,p=1(Fm)∈RH/4×W/4×60F_l = \mathrm{Conv}_{3 \times 3, s=2, p=1}(F_m) \in \mathbb{R}^{H/4 \times W/4 \times 60}

Each branch thus processes the image at a distinct resolution, with 60 embedding channels per feature map. Convolutional strides and kernel sizes are selected to subsample the feature maps hierarchically.

Each scale-specific branch passes through a Feature Enhancement Module (FEM), comprising stacks of Residual Swin Transformer Blocks (RSTBs). The numbers of RSTBs per branch are set to {2,4,6}\{2,4,6\} for low, middle, and high scales, respectively.

2. Swin Transformer Block Design and Residual Pathways

2.1 Residual Swin Transformer Block (RSTB)

Within a FEM, a single RSTB consists of K=6K=6 consecutive Swin Transformer Layers (STLs), followed by a 3×33 \times 3 convolution and a skip connection:

for i=1…K: Fi=STL(Fi−1) Fout=Conv3×3(FK)+F0\begin{aligned} &\text{for }i=1 \ldots K:~F_i = \mathrm{STL}(F_{i-1}) \ &F_{\mathrm{out}} = \mathrm{Conv}_{3 \times 3}(F_K) + F_0 \end{aligned}

2.2 Swin Transformer Layer (STL)

Each STL processes X∈Rh×w×dX \in \mathbb{R}^{h \times w \times d} as follows: Y=W-MSA(LN(X))+X Z=SW-MSA(LN(Y))+Y X′=MLP(LN(Z))+Z\begin{aligned} Y &= \mathrm{W\text{-}MSA}(\mathrm{LN}(X)) + X \ Z &= \mathrm{SW\text{-}MSA}(\mathrm{LN}(Y)) + Y \ X' &= \mathrm{MLP}(\mathrm{LN}(Z)) + Z \end{aligned}

  • W-MSA\mathrm{W\text{-}MSA}: window-based multi-head self-attention.
  • SW-MSA\mathrm{SW\text{-}MSA}: shifted window MSA, using cyclic shift and masking.
  • MLP\mathrm{MLP}: feed-forward network, hidden ratio 4, GELU nonlinearity.

2.3 Window-based Self-Attention (W-MSA) and Shifted Window MSA (SW-MSA)

Non-overlapping windows of size M×MM \times M (here M=8M=8) are used. In each, Xw∈RM2×dX_w \in \mathbb{R}^{M^2 \times d} is projected to queries, keys, and values; H≈6H \approx 6 attention heads split d=60d=60 into d/H≈10d/H \approx 10 channels per head. Scaled dot-product attention with learned relative position bias is used:

Attention(Q,K,V)=Softmax(QKTd/H)V\mathrm{Attention}(Q,K,V) = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d/H}}\right) V

In SW-MSA, prior to windowing, features are cyclically shifted by (⌊M/2⌋,⌊M/2⌋)(\lfloor M/2\rfloor, \lfloor M/2\rfloor). A precomputed mask restricts attention to within shifted windows.

3. Cross-Scale Feature Fusion

Following enhancement, the outputs at coarser scales are upsampled using PixelShuffle (sub-pixel convolution, ×2\times2 upscaling) and concatenated channel-wise with the finer-scale features:

Fm′=Conv3×3(concat[Fm,PixelShuffle2(Fl∗)]) Fh′=Conv3×3(concat[Fh,PixelShuffle2(Fm∗)])\begin{aligned} &F'_m = \mathrm{Conv}_{3 \times 3}(\mathrm{concat}[F_m, \mathrm{PixelShuffle}_2(F^*_l)]) \ &F'_h = \mathrm{Conv}_{3 \times 3}(\mathrm{concat}[F_h, \mathrm{PixelShuffle}_2(F^*_m)]) \end{aligned}

After final refinement at the highest scale, two further sub-pixel upsampling layers (PixelShuffle ×2\times 2 each) followed by a 3×33 \times 3 convolution reconstruct a 4×4 \times super-resolved high-resolution output.

Fusion is restricted to adjacent scales only (i.e., no direct low-to-high long skip).

4. Mathematical Properties and Implementation

  • Strided convolution implements patch merging (downsampling), maintaining channel width 60 at all scales.
  • Window size M=8M=8, heads H≈6H\approx 6, d =60= 60 (per token).
  • PixelShuffle is used throughout for spatial upsampling by rearranging features from channel to space.
  • Unlike canonical Swin, channel dimension is fixed across scales; only spatial dimension shrinks via stride.

5. Distinctive Design Choices and Comparative Analysis

  • CNN-based Multi-Scale Branching: Rather than Swin’s patch embeddings plus four-level patch merging (with channel doubling per merge), HST extracts three parallel resolutions using fixed-width convolution, fusing them progressively upward.
  • Residual Swin Transformer per Scale: Local inductive bias (convolutions) is exploited for robust low-level feature extraction, while global self-attention (shifted windows) enables context modeling within each scale.
  • PixelShuffle Fusion: Finer-scale branches receive contextual cues from coarser-scale branches, increasing effective receptive field without excessively deepening any single branch.
  • Performance Gains: In ablation, a three-branch HST encoder exceeds one-branch SwinIR-style backbones by up to +0.15 dB PSNR on JPEG Q=40, with consistent SSIM improvements across compression quality levels.

6. Practical Usage and Performance

The HST encoder, in conjunction with super-resolution pretraining, reached 5th place (PSNR: 23.51 dB) in the AIM 2022 low-quality compressed image super-resolution challenge. The combination of hierarchical convolutional and transformer-based enhancement, narrow-channel multi-scale design, and cross-scale fusion delivers state-of-the-art compressed image restoration with robust generalization and competitive computational efficiency (Li et al., 2022).

7. Broader Context and Implications

The architectural innovations in the Hierarchical Swin Transformer Encoder highlight the trend toward hybrid, multi-scale transformer backbones for challenging low-level vision tasks. By integrating convolutional and attention mechanisms in a hierarchical schema, the encoder realizes both strong locality (necessary for artifact removal) and scalable non-local modeling (critical for reconstruction from highly compressed signals). The empirical evidence underlines the utility of such hybrid designs in domains where purely flat or single-branch transformers struggle to jointly handle heterogeneous, multi-scale distortions. The HST approach is foundational for extensions in other domains—such as video or multimodal hierarchical transformers—where analogous multi-scale fusion and residual self-attention modules have appeared.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Swin Transformer Encoder.