Customized Residual SwinTransformerV2

Updated 12 January 2026

The paper introduces RSwinV2, a modified Swin Transformer V2 that integrates residual connections and post-norm stabilization to improve efficiency and task adaptation.
It employs windowed scalable self-attention with scaled cosine attention and IRB-based local module enhancements, ensuring improved convergence and performance.
Empirical results across image compression, restoration, and medical classification reveal RSwinV2 reduces complexity and iterations while achieving superior accuracy.

The Customized Residual SwinTransformerV2 (RSwinV2) refers to a class of hierarchical vision transformer backbones that augment and modify the original Swin Transformer V2 (SwinV2) design for improved efficiency, stability, and task adaptation. Core technical advances include the integration of residual connections at multiple levels, post-norm stabilization, windowed scalable self-attention (often with Scaled Cosine Attention), and selective convolutional enhancements or local-invariant modules. RSwinV2 has been successfully deployed in domains such as learned image compression, compressed image super-resolution, and medical diagnosis from skin imagery, demonstrating gains in efficiency, convergence, and rate-distortion or classification performance relative to both convolutional and prior transformer-based architectures (Wang et al., 2023, Conde et al., 2022, Iqbal et al., 5 Jan 2026).

1. Architectural Overview and Variants

Customized RSwinV2 backbones share a hierarchical structure that processes inputs as sequences of non-overlapping image patches with patch embedding and positional encoding. They employ a multi-stage pipeline:

Feature Enhancement: A preliminary module comprises three successive convolutions (1×1 → 3×3 → 1×1) to enrich non-linear representations, with a lightweight increase in runtime and parameter budget (Wang et al., 2023).
Main Backbone: The core analysis and synthesis transforms are composed of stacks of Residual SwinV2 Transformer Blocks (RS2TBs or RSTBs), each containing windowed or shifted-window multi-head self-attention, and MLP or convolutional local modules. In some variants (e.g., for classification), an Inverse Residual Block (IRB) is employed after the attention sublayer, leveraging depthwise convolution and pointwise expansion for local pattern extraction and stabilizing gradient flow (Iqbal et al., 5 Jan 2026).
Down/Up-Sampling: Patch-merge layers reduce spatial resolution and double feature dimensions between stages.
Task-Specific Heads: For image compression, the model concludes with an entropy-coded hyperprior branch. For classification, a global pooled or [CLS] token is linearly projected to class logits (Wang et al., 2023, Iqbal et al., 5 Jan 2026).

A typical block diagram for RSwinV2 (classification variant) is captured in the following pseudocode:

function RSwinV2(input_image):
    x = PatchEmbed(input_image)
    x = x + PosEmbed
    x = prepend_cls_token(x)
    for stage in [1,2,3,4]:
        for i in range(num_blocks[stage]):
            y = LayerNorm(x)
            y = W-MHA(y, shift=(i%2==1))
            x = x + y
            z = LayerNorm(x)
            z = IRB(z)
            x = x + z
        if stage < 4:
            x = PatchMerge(x)
    cls_token = x[0]
    logits = LinearHead(cls_token)
    return Softmax(logits)

(adapted from (Iqbal et al., 5 Jan 2026)).

2. Key Modifications to SwinV2 and RSwinV2 Block Formulations

All RSwinV2 variants introduce crucial modifications over the standard SwinV2 block design:

Residual ("skip join") connections are added around each sublayer—both the attention module and MLP or IRB—to improve gradient flow and feature aggregation. In variants focused on image restoration, a block-level skip is added post-convolution (Conde et al., 2022).
Post-Norm Stabilization: Every sublayer applies LayerNorm before, rather than after, the primary operation (attention, MLP, or IRB), addressing the exploding/vanishing feature variance in deeper networks (Wang et al., 2023, Conde et al., 2022).
Scaled Cosine Attention: The dot-product attention mechanism is replaced with Scaled Cosine Attention,

$\mathrm{Attention}_{\mathrm{SCA}}(Q,K,V) = \operatorname{Softmax}\left(\frac{\cos(Q,K)}{\tau} + S\right) V$

where $\tau$ is a learnable scale, and $S$ is a relative positional bias (Wang et al., 2023, Conde et al., 2022).

Windowed and Shifted-Window Attention: Each stage partitions the feature map into windows (e.g., 8×8 or 7×7 regions), computing self-attention within each, and alternates between "regular" and "shifted" windows to propagate information across region boundaries (Iqbal et al., 5 Jan 2026).
MLP/IRB Layer: The feed-forward stage employs either an MLP with expansion ratio $r=4$ (typical) or, in some customized versions (e.g., for medical imaging), an IRB comprising linear expansion, GELU activation, depthwise convolution, and projection, all with a residual add. This captures both local spatial and global contextual dependencies (Iqbal et al., 5 Jan 2026).

3. Formal Mathematical Structure

Given a feature map $X \in \mathbb{R}^{H \times W \times C}$ , the RSwinV2 block operates as follows (notation adapted for clarity):

Feature Embedding and Sequencing: Reshape $X$ to $X^{(0)} \in \mathbb{R}^{N \times C}$ , $N=HW$ .
Attention Sublayer:

$Y^{(1)} = X^{(0)} + \operatorname{Attention}_{\mathrm{SCA}}(\operatorname{LN}(X^{(0)}))$

Feed-Forward (MLP or IRB) Sublayer:

$Y^{(2)} = Y^{(1)} + \mathrm{MLP/IRB}(\operatorname{LN}(Y^{(1)}))$

Feature Unembedding: Map $Y^{(2)}$ back to $H \times W \times C$ for the next stage.

The IRB specifically performs: $U = \text{Linear}^{\text{expand}}(X)$

$U = \mathrm{GELU}(U)$

$U = \text{DWConv}_{3 \times 3}(U)$

$U = \mathrm{BN}(U)$

$U = \text{Linear}^{\text{project}}(U)$

$\text{IRB}(X) = X + U$

as described in (Iqbal et al., 5 Jan 2026).

4. Training Regimes and Hyperparameters

Training procedures are adapted to task and data modality:

Image Compression (Wang et al., 2023): Optimized using

$L(\lambda) = R(\hat{y}) + \lambda D(x, \hat{x})$

with $R(\hat{y})$ for bitrate (entropy of quantized codes), $D$ as MSE or $1-\text{MS-SSIM}$ , and $\lambda$ controlling the rate–distortion tradeoff. Typical settings: Adam optimizer, $1\times10^{-4}$ learning rate, 400 epochs, embedding $C=128$ (low), $C=192$ (high), MLP ratio $r=4$ , window size $M=8$ .

Classification (Medical Imaging) (Iqbal et al., 5 Jan 2026): Multi-class cross-entropy loss, Adam optimizer with initial learning rate $1\times10^{-3}$ , weight decay $0.04$, learning rate decay every 20 epochs, batch size 16, final head dropout $p=0.3$ .
Super-Resolution/Restoration (Conde et al., 2022): L1 reconstruction, auxiliary downsample, and high-frequency priors, α=β=0.1; trained with Adam, β1=0.9, β2=0.99, 800k iterations (for classical SR), self-ensemble at test time.

5. Empirical Performance and Model Efficiency

RSwinV2 achieves significant efficiency and performance improvements:

Task/Domain	Model Size Reduction	Accuracy/PSNR Gain	Notable Metrics/Comparisons
Image Compression (Wang et al., 2023)	56–57% vs. prior art	+0.23 dB PSNR (Kodak, high bpp)	Matches/outperforms VVC, BPG, JPEG2000 in MS-SSIM and PSNR
Image Restoration (Conde et al., 2022)	1M–12M (lightweight)	+0.05 dB PSNR (SR)	33% fewer iters vs. SwinIR, faster convergence
Medical Classification (Iqbal et al., 5 Jan 2026)	N/A	96.2% acc, 95.6 F1 (Mpox)	Beats ResNet18, LeViT, SwinT by +1.4% acc.

The results indicate that model complexity is reduced by over half versus comparable transformer-based methods with equivalent or superior accuracy, faster convergence (up to 33% fewer iterations in super-resolution), and better feature-space class separation. The lightweight feature enhancement in (Wang et al., 2023) further leads to up to 0.34 dB PSNR improvement with negligible added runtime.

6. Practical Implementations and Compute Considerations

RSwinV2 scales efficiently:

Computational Complexity: Each windowed-attention block scales as $\mathcal{O}(M \cdot d^2)$ per window, with $M$ the window size and $d$ the embedding; IRB adds $\mathcal{O}(M\cdot d\cdot r)$ (expansion ratio $r=4$ ). The overall complexity is nearly linear in the image area due to windowed decomposition, outscaling global-vision transformers (Iqbal et al., 5 Jan 2026).
Hardware Utilization: Exploits parallel batch mat-multiplications and window processing for GPU efficiency; practical training and inference benchmarks demonstrate sub-10 ms per image at batch size 16 on modern GPUs (RTX 4070 Ti), and model sizes as small as 33 MB (low-bpp image compression) or 1M parameters for lightweight super-resolution (Wang et al., 2023, Conde et al., 2022, Iqbal et al., 5 Jan 2026).
Stability: Use of post-norm and residual connections leads to smoother loss curves and absence of instability observed in standard ViT/SwinT baselines.

7. Applications and Extensions

Learned Image Compression: RSwinV2 yields superior rate-distortion tradeoffs and practical deployment advantages over classical and learning-based codecs, matching VVC/Cheng2020 with less than half the parameter count (Wang et al., 2023).
Compressed Image Super-Resolution and Artifact Removal: Swin2SR demonstrates robust performance (top-5 in AIM 2022 challenge) and generalization across compression levels, outperforming SwinIR and CNN-based competitors in both restoration quality and training speed (Conde et al., 2022).
Medical Image Classification: Customized RSwinV2 with patch and positional embeddings, windowed attention, and IRB enhancement surpasses standard CNN and transformer backbones on diverse lesion classification tasks, with notable gains in both accuracy and F1-score for Mpox, Chickenpox, Measles, and Cowpox identification (Iqbal et al., 5 Jan 2026).

These results collectively establish RSwinV2 as a flexible, high-efficiency transformer backbone suitable for a range of high-performance image analysis and compression tasks, combining scalable attention architectures with judicious residual and convolutional enhancements.