Fused-MBConv Operator in EfficientNetV2

Updated 23 February 2026

Fused-MBConv operator is a CNN primitive that fuses expansion and depthwise convolutions to simplify MBConv blocks and optimize efficiency.
It reduces computational complexity by combining operations, eliminating squeeze-and-excitation modules, and enhancing performance on modern hardware.
Empirical evaluations in EfficientNetV2 reveal faster training and inference in early stages, despite increased parameter cost in deeper layers.

The Fused-MBConv operator is a convolutional neural network (CNN) architectural primitive introduced as part of the EfficientNetV2 search space. It is designed to optimize both training speed and parameter efficiency by streamlining block structure and maximizing the effectiveness of hardware-optimized dense convolutional kernels. Fused-MBConv fuses the expansion and depthwise convolutions of the standard MBConv block into a single regular convolution, omits squeeze-and-excitation (SE) modules, and demonstrates empirical benefits when applied selectively in early network stages, especially on contemporary accelerator hardware (Tan et al., 2021).

1. Block Structure and Comparison with MBConv

Fused-MBConv modifies the traditional MBConv block used in MobileNet-like architectures. In standard MBConv, the computational flow is: 1×1 expansion convolution, depthwise k×k convolution, SE module, 1×1 projection convolution, and optional residual connection. In contrast, Fused-MBConv fuses the expansion (1×1) and depthwise k×k operations into a single regular k×k convolution, followed by a 1×1 projection convolution. The block omits the SE module entirely.

Structural Details

Component	MBConv (Standard)	Fused-MBConv
Expansion	1×1 conv (C_in→C_exp)	k×k conv (C_in→C_exp)
Depthwise convolution	k×k depthwise (C_exp)	—
Squeeze-and-Excite	Yes (ratio 0.25)	No
Projection	1×1 conv (C_exp→C_out)	1×1 conv (C_exp→C_out)
Residual	Yes (if stride=1, C_in=C_out)	Yes (if stride=1, C_in=C_out)

From a workflow perspective, the Fused-MBConv reduces the number of convolutions and removes the SE path, thereby lowering architectural complexity and memory access overhead (Tan et al., 2021).

2. Formal Parameter and FLOP Cost Analysis

Let $C_{\rm in}$ denote input channels, $C_{\rm out}$ output channels (often $C_{\rm out} = C_{\rm in}$ ), $r$ the expansion ratio, $C_{\rm exp} = r\,C_{\rm in}$ , $k$ the kernel size (typically 3), and $H\times W$ the spatial resolution.

Parameter Count

Standard MBConv (ignoring SE):

$P_{\rm MB} = r\,C_{\rm in}^2 + k^2\,r\,C_{\rm in} + r\,C_{\rm in}^2 = 2r\,C_{\rm in}^2 + k^2\,r\,C_{\rm in}$

Fused-MBConv:

$P_{\rm Fused} = k^2\,r\,C_{\rm in}^2 + r\,C_{\rm in}^2 = r\,C_{\rm in}^2\,(k^2 + 1)$

FLOPs

MBConv:

$F_{\rm MB} = HW\left[C_{\rm in}\,C_{\rm exp} + k^2\,C_{\rm exp} + C_{\rm exp}\,C_{\rm out}\right]$

Fused-MBConv:

$F_{\rm Fused} = HW\left[k^2\,C_{\rm in}\,C_{\rm exp} + C_{\rm exp}\,C_{\rm out}\right]$

Relative Overhead

For $C_{\rm out} = C_{\rm in}$ ,

$\frac{P_{\rm Fused}}{P_{\rm MB}} = \frac{k^2 + 1}{2 + \frac{k^2}{C_{\rm in}}}$

For small $C_{\rm in}$ , the ratio tends to $\frac{k^2+1}{2}$ , indicating comparatively higher parameter cost for Fused-MBConv in shallow stages. At high $C_{\rm in}$ , the ratio also approaches $(k^2+1)/2$ (Tan et al., 2021).

3. Implementation Example

A canonical PyTorch-style Fused-MBConv operator is structured as follows:

import torch
import torch.nn as nn

class FusedMBConv(nn.Module):
    """Fused-MBConv: conv(k×k)→BN→Swish→conv(1×1)→BN + optional residual."""
    def __init__(self, in_ch, out_ch, expansion=4, kernel_size=3, stride=1):
        super().__init__()
        hidden_ch = in_ch * expansion
        self.use_res_connect = (stride == 1 and in_ch == out_ch)
        self.conv1 = nn.Conv2d(in_ch, hidden_ch,
                               kernel_size=kernel_size,
                               stride=stride,
                               padding=(kernel_size//2),
                               bias=False)
        self.bn1 = nn.BatchNorm2d(hidden_ch)
        self.conv2 = nn.Conv2d(hidden_ch, out_ch,
                               kernel_size=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_ch)
        self.activation = nn.SiLU(inplace=True)  # Swish

    def forward(self, x):
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.activation(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.use_res_connect:
            return x + out
        else:
            return out

The block uses a sequence of regular $k\times k$ convolution (for expansion) and $1\times1$ projection convolution with Swish activation, batch normalization, and an optional residual connection (Tan et al., 2021).

4. Empirical Trade-offs and Throughput

Empirical evaluation within EfficientNet-B4–scale models demonstrates stage-dependent trade-offs as follows:

Scenario (stages replaced with Fused-MBConv)	Params	FLOPs	TPU throughput	V100 throughput	Top-1 Accuracy
Stages 1–3	20.0M	7.5B	362 imgs/s	216 imgs/s	83.1%
Baseline (all MBConv)	19.3M	4.5B	262 imgs/s	155 imgs/s	82.8%
All stages 1–7	132M	34.4B	254 imgs/s	—	81.7%

Selective use of Fused-MBConv in early stages (where $C_{\rm in}$ is small) increases training throughput by 38% (TPU) and 39% (V100) with minimal parameter growth or accuracy drop versus MBConv. Replacing all MBConv blocks with Fused-MBConv leads to a substantial parameter/FLOP increase and a decrease in accuracy and throughput, indicating diminishing returns in later, wide-channel network regions. Fused-MBConv optimizes early inference and training efficiency, leveraging regular convolution primitives that are more heavily optimized on modern hardware (Tan et al., 2021).

5. Role in EfficientNetV2 Architecture Search and Scalability

The introduction of Fused-MBConv as a new search space primitive was instrumental in the EfficientNetV2 neural architecture search (NAS). The automated search algorithm predominantly selected Fused-MBConv for stages 1–3 (when $C_{\rm in}$ is small), while retaining standard MBConv (with SE) in deeper stages to manage resource utilization. In the EfficientNetV2-S configuration, the stage allocation is:

Stages 1–3: Fused-MBConv ( $r=1$ in stage 1, $r=4$ in stages 2–3)
Stages 4–6: MBConv with SE ( $r=4$ or $6$)

This staged approach achieves a training step time 3× faster than EfficientNetV1 of similar size, supporting both faster convergence and improved parameter efficiency. The reliance on regular convolutions aligns with the performance advantages of dense kernel implementations in GPU/TPU environments. Overall, Fused-MBConv enables training time reductions of up to 11× and inference latencies up to 3× lower compared to previous baselines, without a parameter efficiency trade-off in the critical early layers (Tan et al., 2021).

6. Context, Limitations, and Hardware Considerations

Fused-MBConv demonstrates maximal benefit in shallow layers where input channel count is low. This is attributed to the avoidance of memory-bound depthwise convolution operations and the ability to exploit highly optimized routines for regular convolutions on GPU and TPU architectures. Excessive use throughout the network causes inefficient parameter scaling and reduced throughput, especially in wide, late-stage blocks. As such, Fused-MBConv is most effective when deployed selectively as part of a hybrid strategy with MBConv. This design choice is reflected in the optimal architectures discovered through NAS for EfficientNetV2 (Tan et al., 2021).

A plausible implication is that future architecture searches may further customize operator selection on a per-stage basis in resource-constrained and hardware-specific scenarios.

Markdown Report Issue Upgrade to Chat

References (1)

EfficientNetV2: Smaller Models and Faster Training (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fused-MBConv Operator.