Inverse Residual Block (IRB)

Updated 12 January 2026

Inverse Residual Block (IRB) is a neural network module that integrates pointwise and depthwise convolutions with residual connections to enhance local feature extraction.
IRB improves gradient flow and nonlinearity by replacing standard feedforward layers in transformer architectures, leading to efficient model convergence.
Empirical results in RSwinV2 show that incorporating IRB boosts accuracy and F1 scores in dense vision tasks such as medical image classification and learned image compression.

An Inverse Residual Block (IRB) is a neural network building block that replaces conventional feedforward network (FFN) layers within transformer-style architectures, notably the Residual SwinTransformerV2 (RSwinV2). IRB fuses pointwise and depthwise convolutions via a residual connection to enhance local feature extraction, improve gradient flow, and boost nonlinearity while maintaining computational efficiency. The IRB has been systematically deployed in hierarchically staged, windowed self-attention models designed for dense vision tasks including medical image classification and learned image compression (Iqbal et al., 5 Jan 2026, Wang et al., 2023).

1. Formal Definition and Structural Motivation

The IRB addresses two principal architectural challenges in vision transformers: the inability of pure self-attention layers to fully capture locally correlated patterns and the risk of vanishing gradients in deep cascades. In RSwinV2, every transformer block replaces its standard FFN with an IRB, creating a composite mechanism that integrates both transformer-based long-range dependencies and convolutional local priors (Iqbal et al., 5 Jan 2026). The IRB’s skip connection ensures the preservation of the input information stream across layers, directly mitigating gradient attenuation.

2. IRB Layer Composition and Mathematical Formulation

Given an input $X \in \mathbb{R}^{M \times d}$ , each IRB consists of the following operations:

1×1 Pointwise Convolution (Expansion): Expands the feature channels from $d$ to $r \cdot d$ (typically $r = 4$ ), promoting higher-rank feature mixing.
3×3 Depthwise Convolution: Spatially local filtering with padding=1, allowing pixel-level context aggregation within each expanded channel.
GELU Activations and Batch Normalization: Follows both convolutions, increasing nonlinearity and stabilizing training.
1×1 Pointwise Convolution (Projection): Projects the feature channels back from $r \cdot d$ to $d$ , restoring the original tensor dimensionality.
Residual Addition: The final output is $Y = X + F_{IRB}(X)$ , where $F_{IRB}(X)$ denotes the composite convolutional processing described above.

This structure systematically introduces a convolutional skip connection, contrasting with the linear transformations in a vanilla FFN, and is always paired with post-layer normalization (Iqbal et al., 5 Jan 2026).

3. Integration Within RSwinV2 Transformer Blocks

In the RSwinV2 paradigm, a block alternates self-attention and IRB operations. The canonical sequence is:

$Z' = Z + MHA(\text{LN}(Z))$
$Z'' = Z' + IRB(\text{LN}(Z'))$

Here, $MHA$ denotes multi-head self-attention, and $LN$ is layer normalization. The IRB directly replaces the FFN segment of earlier transformer designs (Iqbal et al., 5 Jan 2026). Several such blocks are stacked per stage in a four-stage, hierarchical model with patch partitioning and merging, facilitating both spatial dimension reduction and feature enrichment (Iqbal et al., 5 Jan 2026).

4. Comparative Computational Characteristics

The IRB is computationally efficient relative to vanilla FFNs due to its use of pointwise-plus-depthwise convolutions, substantially lowering parameterization and operational cost. In RSwinV2 applied to 224×224 image classification, the total model parameter count is approximately 50M, similar to Swin-V2 baselines but with improved empirical convergence and accuracy metrics (Iqbal et al., 5 Jan 2026). In learned compression, replacement of deeper convolutional backbones with residual SwinV2 blocks incorporating similar patterns (though not always labeled "IRB") reduces model complexity by over 56% compared to prior art (Wang et al., 2023).

5. Empirical Performance and Practical Implications

In skin lesion classification for monkeypox and related diseases, RSwinV2 with IRB achieved 96.21% test accuracy and a 95.62% F1 score on the Kaggle dermatology dataset (five-class task), outperforming ResNet-18, Swin-T, vanilla Swin-V2, DenseNet-201, and ViT backbones by 1–3% absolute margins (Iqbal et al., 5 Jan 2026). This was attributed in part to superior modeling of both local (e.g., lesion texture) and global (e.g., lesion distribution) features.

The table below summarizes comparative results:

Model	Accuracy	F1 Score
ResNet-18	94.77%	94.17%
Swin-T	95.31%	94.96%
Swin-V2	96.03%	95.42%
RSwinV2 (with IRB)	96.21%	95.62%

The IRB's local pathway reduces intra-class variability and enhances inter-class discrimination, key for fine-grained visual identification tasks (Iqbal et al., 5 Jan 2026).

6. Broader Applications and Variants

Elements of the IRB architecture—convolutional residual pathways, pointwise-depthwise sequences, and block-level skip connections—are present in variants of SwinV2 transformers for image restoration, super-resolution, and learned compression (Conde et al., 2022, Wang et al., 2023). In these settings, "Residual SwinTransformerV2 Block" (RSTB) or "Residual SwinV2 Transformer Block" (RS2TB) modules encapsulate similar designs, enhancing convergence and training stability, notably with post-norm and scaled-cosine attention mechanisms (Conde et al., 2022, Wang et al., 2023).

Although the explicit IRB label is specific to RSwinV2-MD, the core design principle (local convolutional residual sub-block) is a generalizable pattern within the SwinV2 family.

7. Limitations and Future Prospects

Despite improved efficiency and robustness, IRB-augmented networks remain subject to limitations such as performance drops under class imbalance, and potential generalization gaps across imaging modalities (e.g., moving from 2D skin images to low-contrast or volumetric data). Subsequent work proposes lighter IRB designs for on-device inference and dynamic window-shifting strategies for better cross-domain adaptability. Future research directions include semi-supervised pretraining and dynamic architectural variants to further leverage IRB's benefits in resource-constrained or multi-modal clinical contexts (Iqbal et al., 5 Jan 2026).