Inverted Residual Block in CNNs

Updated 23 June 2026

Inverted Residual Block is a CNN module that reorders the classic bottleneck structure by first expanding channels before applying depthwise convolutions and projecting back via a linear bottleneck.
It achieves a balance between expressiveness and efficiency, reducing computational cost and memory footprint while maintaining performance, as demonstrated by MobileNetV2.
Architectural variants like the Sandglass Block and attention-augmented designs further optimize gradient flow and accuracy, especially for resource-constrained applications.

An inverted residual block is a convolutional neural network (CNN) module that reorders the classic residual bottleneck structure—expanding first into a high-dimensional space, applying spatial filtering via depthwise convolutions, and then projecting back to a low-dimensional bottleneck—while routing the residual (skip) connection through the bottleneck rather than the expanded representation. Initially popularized by MobileNetV2 for efficient mobile vision, this paradigm underpins numerous lightweight CNN and hybrid attention-CNN architectures, and has influenced network design across diverse domains by enabling a favorable trade-off between expressiveness, parameter/FLOP efficiency, and memory footprint (Sandler et al., 2018, Daquan et al., 2020, Zhang et al., 2023, Chiang et al., 2022).

1. Core Structure and Mathematical Formulation

Let the input $\mathbf{x} \in \mathbb{R}^{C_{\text{in}}\times H\times W}$ . The canonical inverted residual block with expansion factor $t$ performs the following operations:

Expansion (1×1 Conv): $\mathbf{u} = \mathrm{Conv}_{1\times1}(\mathbf{x};\,C_{\text{in}}\rightarrow tC_{\text{in}})$ , followed by normalization (e.g., BN or GN) and activation (e.g., ReLU6 or ReLU) (Sandler et al., 2018, Daquan et al., 2020, Chiang et al., 2022).
Depthwise Convolution (k×k): Applies a spatial $k\times k$ kernel per channel: $\mathbf{v} = \mathrm{DWConv}_{k\times k}(\mathbf{u})$ , followed by normalization and activation.
Projection (1×1 Conv, linear): Reduces channels back to $C_{\text{out}}$ : $\mathbf{y} = \mathrm{Conv}_{1\times1}(\mathbf{v};\,tC_{\text{in}}\rightarrow C_{\text{out}})$ , with normalization and no activation in most designs (“linear bottleneck”).
Residual Addition: If $C_{\text{out}} = C_{\text{in}}$ and stride $= 1$ , outputs $z = \mathbf{x} + \mathbf{y}$ . Otherwise $t$ 0 (Sandler et al., 2018, Pendse et al., 2021, Chiang et al., 2022).

This sequence is generalized in 3D (e.g., for medical imaging) by extending the kernel dimensions and using group normalization when batch sizes are small (Pendse et al., 2021).

2. Design Principles: Inversion and Linear Bottlenecks

Inverted residuals invert the ResNet bottleneck: standard blocks compress channels before expensive spatial convolutions, while the inverted approach expands channels before depthwise convolution and then compresses. The skip connection inverts its topology, connecting the low-dimensional endpoints (Sandler et al., 2018, Daquan et al., 2020).

Linear bottlenecks omit nonlinearity after the final projection, based on the empirical and theoretical finding that applying ReLU to a low-dimensional representation destabilizes or collapses the signal manifold, leading to information loss and an observable accuracy drop (>2 percentage points on ImageNet when a ReLU is added after the projection) (Sandler et al., 2018).

By decoupling representational expressiveness (enabled by the high-dimensional expansion and non-linear depthwise convolution) from the network's capacity (bottleneck size and skip path), the block achieves resource efficiency without significant loss in accuracy (Sandler et al., 2018, Chiang et al., 2022).

3. Variants and Architectural Extensions

Several variants extend or challenge the canonical design:

Sandglass Block (Daquan et al., 2020): Moves the skip connection to the expanded high-dimensional representation and distributes spatial convolutions before channel reduction/after channel expansion, alleviating risks of information loss and gradient confusion found in classical inverted bottlenecks. This yields improved gradient flow and accuracy (+1.7% on ImageNet at constant cost).
Attention-augmented Inverted Residuals (Pan et al., 27 May 2025): The AIR block integrates hybrid channel–spatial attention between expansion and projection, using spatial and channel gating followed by an additive self-attention mechanism. This design enhances discriminative feature amplification while further reducing parameters and FLOPs, empirically improving precision and mAP for detection tasks.
DPD Block (Li et al., 2019): Replaces the initial pointwise expansion conv with a depthwise expansion, decreasing parameter and FLOP count further, with empirical results demonstrating similar accuracy at ~60% the MobileNetV2 cost for the same layer count.

Block-level reversibility, as in reversible MBConv (Pendse et al., 2021), uses paired inverted bottlenecks to enable memory-efficient gradient computation by recomputing activations from outputs, allowing larger volumes or higher channel count under fixed memory budgets.

4. Computational Efficiency and Resource Utilization

The parameter and computational cost of a standard block with expansion factor $t$ 1 and kernel size $t$ 2 is: $t$ 3 with the maximal tensor size always set by the bottleneck width, not the expanded width (Zhang et al., 2023, Sandler et al., 2018).

For memory-constrained and edge settings, the block structure enables aggressive pruning, partial fine-tuning, and low activation storage, as demonstrated by the MobileTL framework, which reduces memory by ≈50% and FLOPs by ~36% during fine-tuning (Chiang et al., 2022).

5. Empirical Performance and Infrastructure Impact

The inverted residual block structure is the backbone of efficient models such as MobileNetV2, MnasNet, and EfficientNet (as MBConv), forming the basis for both CNN-centric and hybrid vision architectures. On standard ImageNet settings, MobileNetV2 achieves 72.0% top-1 at 300M MAdd and 3.4M parameters, outperforming or matching earlier models at similar or higher compute (Sandler et al., 2018); variants introducing attention (Pan et al., 27 May 2025) or redesigned skip locations (Daquan et al., 2020) further improve trade-offs.

In practical detection tasks (e.g., YOLO-FireAD), AIR blocks halve parameter and FLOP counts while increasing mAP by 1.8 percentage points over YOLOv8n (Pan et al., 27 May 2025). Reversible architectures leveraging inverted residuals allow for 3× larger training volumes or 2× the number of channels under fixed memory, with comparable or superior segmentation accuracy (Pendse et al., 2021).

6. Limitations and Ongoing Developments

Identified risks include information loss at the projection step and “gradient confusion” due to stalled optimization in the narrow residual path, both of which motivate ongoing research into alternate skip topologies (e.g., sandglass, iRMB) and operator placement. Deeper blocks with more DWConv layers for expansion and filtering exhibit better spatial feature extraction at equal or lower compute (Li et al., 2019).

The compatibility of the pattern with hybrid (e.g., attention-infused) or reversible designs supports ongoing architectural unification and NAS-driven optimization (Zhang et al., 2023).

7. Summary Table: Canonical Block Operations

Stage	Operator	Output Shape
Expansion	1×1 Conv (+Norm + Act)	$t$ 4
Depthwise Conv	k×k DWConv (+Norm + Act)	$t$ 5
Projection	1×1 Conv (+Norm, no Act)	$t$ 6
Residual Add	$t$ 7 (if $t$ 8, stride=1)	$t$ 9