Fused-MBConv in EfficientNetV2

Updated 8 February 2026

Fused-MBConv is an operator that fuses 1×1 expansion and k×k depthwise convolution into a single operation to enhance training throughput and parameter efficiency.
It streamlines the convolutional block by using a fused k×k convolution followed by batch normalization, activation, optional squeeze-and-excitation, and projection.
Empirical results indicate that using Fused-MBConv in early network stages reduces training time while slightly increasing parameter counts and boosting overall accuracy.

Fused-MBConv is an operator introduced in the EfficientNetV2 convolutional architecture, designed to optimize training throughput and parameter efficiency by rethinking the structure of the canonical MBConv (Mobile Inverted Bottleneck Convolution) block. Fused-MBConv eliminates the traditional separation between the 1×1 expansion and the $k\times k$ depthwise convolution by merging them into a single $k\times k$ convolution with increased output channels, followed by batch normalization, activation, optional squeeze-and-excitation (SE), a projection, and skip connection (when applicable). This fused structure trades slightly higher parameter counts for significantly improved hardware utilization and faster throughput, especially in the early, narrow layers of modern convolutional neural networks (Tan et al., 2021).

1. Block Structure and Mathematical Formulation

The standard MBConv architecture consists of a sequence of expansion (1×1 convolution), depthwise (spatial $k\times k$ ) convolution, squeeze-and-excitation, projection (1×1 convolution), and residual addition under constraints. In contrast, Fused-MBConv executes the expansion and spatial filtering in a single $k\times k$ convolution. The block’s structure can be precisely described as:

Fused Conv: $k\times k$ Conv, input channels $C_{\mathrm{in}}$ , output channels $t\,C_{\mathrm{in}}$ , stride $s$ .
BatchNorm and Activation
Squeeze-and-Excitation (optional):
- $u = \mathrm{GlobalAvgPool}(Z_1)$
- $e = \sigma(W_2\,\mathrm{Act}(W_1\,u))$
- $k\times k$ 0
Projection: $k\times k$ 1 Conv, $k\times k$ 2, BN.
Residual: If $k\times k$ 3 and $k\times k$ 4, add input.

Mathematically, for input $k\times k$ 5, expansion ratio $k\times k$ 6, kernel size $k\times k$ 7, stride $k\times k$ 8, and output channels $k\times k$ 9: $k\times k$ 0

This architecture eliminates the depthwise convolution and expansion 1×1 convolution, integrating both into a single dense $k\times k$ 1 convolution, which results in improved hardware efficiency (Tan et al., 2021).

2. Implementation Workflow and Pseudocode

The Fused-MBConv block can be instantiated with the following pseudocode that details tensor shapes and parameterization:

$t\,C_{\mathrm{in}}$ 5

For example, in Stage 1 of EfficientNetV2-S, with $k\times k$ 2, $k\times k$ 3, $k\times k$ 4, $k\times k$ 5, the fused conv is $k\times k$ 6 (5184 parameters), projection $k\times k$ 7 (576 parameters), totaling approximately 5760 parameters per block. Later stages with $k\times k$ 8 (e.g., $k\times k$ 9, 20736 parameters plus projection, 4608 parameters) total approximately 25,344 per block (Tan et al., 2021).

3. Computational Complexity and Parameter Analysis

Fused-MBConv and MBConv differ in their FLOPs and parameter composition. Let $k\times k$ 0 be kernel size, $k\times k$ 1 spatial dimensions.

MBConv:
- Expand $k\times k$ 2: $k\times k$ 3
- Depthwise $k\times k$ 4: $k\times k$ 5
- Project $k\times k$ 6: $k\times k$ 7
- $k\times k$ 8
- Parameters: $k\times k$ 9
Fused-MBConv:
- Fused $k\times k$ 0: $k\times k$ 1
- Project $k\times k$ 2: $k\times k$ 3
- $k\times k$ 4
- Parameters: $k\times k$ 5

Although Fused-MBConv generally incurs a greater parameter and FLOP cost than standard depthwise-separable convolution, the increase is limited in the early stages (where $k\times k$ 6 is small) and is offset by improved throughput due to better accelerator utilization (Tan et al., 2021).

Empirical data, e.g., EfficientNet-B4 baseline versus a variant with Fused-MBConv in the early stages: | Configuration | Params | FLOPs | Top-1 | Images/sec (TPUv3) | |-----------------------------------|--------|-------|--------|--------------------| | No Fused (all MBConv) | 19.3M | 4.5B | 82.8% | 262 | | Fused in Stages 1–3 Only | 20.0M | 7.5B | 83.1% | 362 |

Fully replacing all MBConv blocks increases parameter count substantially (e.g., 132M) and degrades training efficiency, motivating a hybrid approach (Tan et al., 2021).

4. Neural Architecture Search and Block Selection

Fused-MBConv arose from training-aware neural architecture search (NAS) utilizing a stage-wise and factorized search over operator type, kernel size, expansion ratio, and repeat count. The search reward for configuration $k\times k$ 7 is: $k\times k$ 8 where $k\times k$ 9 is Top-1 accuracy, $C_{\mathrm{in}}$ 0 is normalized step time, $C_{\mathrm{in}}$ 1 is parameter count, $C_{\mathrm{in}}$ 2, $C_{\mathrm{in}}$ 3.

The search space included the operator choice $C_{\mathrm{in}}$ 4MBConv, Fused-MBConv $C_{\mathrm{in}}$ 5, kernel $C_{\mathrm{in}}$ 6, and expansion $C_{\mathrm{in}}$ 7. Empirical search observations demonstrated:

Early stages (1–3) consistently favor Fused-MBConv for improved throughput.
Later stages (4–7), where $C_{\mathrm{in}}$ 8 is larger, favor conventional MBConv to maintain parameter and FLOP efficiency and exploit depthwise separation.

This hybrid pattern, adopting Fused-MBConv in early blocks and MBConv in later, high-channel-depth blocks, was computationally validated to offer better speed-accuracy tradeoffs than homogeneous block choices (Tan et al., 2021).

5. Empirical Results and Practical Impact

Empirical evaluation of the Fused-MBConv operator within EfficientNetV2 demonstrates:

Training step time is reduced by 30–40% when Fused-MBConv is used in early network stages (e.g., EfficientNetV2-S achieves $C_{\mathrm{in}}$ 920 ms/step for 83.9% Top-1, compared to EfficientNet-V1’s 45 ms/step for similar accuracy).
Selective use of Fused-MBConv in stages 1–3 increases throughput by 38% and Top-1 accuracy by $t\,C_{\mathrm{in}}$ 00.3 percentage points compared to an all-MBConv baseline.
End-to-end model comparison (EfficientNetV2-S: 83.9% Top-1, 22M params, 8.8B FLOPs, 7h train-time) shows superior efficiency relative to earlier architectures (EfficientNet-B7: 84.7% Top-1, 66M params, 38B FLOPs, 139h train-time), with EfficientNetV2-M matching or exceeding B7 accuracy at %%%%59 $k\times k$ 160%%%% faster training and $t\,C_{\mathrm{in}}$ 320% fewer parameters.
Overuse of Fused-MBConv (in later, wide stages) severely increases parameter count and can degrade accuracy, justifying its selective adoption (Tan et al., 2021).

6. Significance and Architectural Implications

Fused-MBConv represents an evolution in mobile and resource-aware convolutional block design. By merging the expand and depthwise operations into a single dense convolution, it addresses accelerator memory access bottlenecks prevalent in depthwise kernels for early-stage, low-channel layers. The block’s design enables modern GPUs/TPUs to operate at higher throughput on these stages with only minor overhead in parameter count, as confirmed by NAS-informed block selection. The significance is particularly evident in training efficiency; EfficientNetV2 models with selectively integrated Fused-MBConv blocks train 3–11 $t\,C_{\mathrm{in}}$ 4 faster end-to-end while preserving or even improving state-of-the-art accuracy across diverse datasets (Tan et al., 2021). The block thus provides a principled, empirically grounded operator that supports both speed and accuracy targets in contemporary convolutional architectures.

Markdown Report Issue Upgrade to Chat

References (1)

EfficientNetV2: Smaller Models and Faster Training (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fused-MBConv.