Fused-MBConv Operator in EfficientNetV2
- Fused-MBConv operator is a CNN primitive that fuses expansion and depthwise convolutions to simplify MBConv blocks and optimize efficiency.
- It reduces computational complexity by combining operations, eliminating squeeze-and-excitation modules, and enhancing performance on modern hardware.
- Empirical evaluations in EfficientNetV2 reveal faster training and inference in early stages, despite increased parameter cost in deeper layers.
The Fused-MBConv operator is a convolutional neural network (CNN) architectural primitive introduced as part of the EfficientNetV2 search space. It is designed to optimize both training speed and parameter efficiency by streamlining block structure and maximizing the effectiveness of hardware-optimized dense convolutional kernels. Fused-MBConv fuses the expansion and depthwise convolutions of the standard MBConv block into a single regular convolution, omits squeeze-and-excitation (SE) modules, and demonstrates empirical benefits when applied selectively in early network stages, especially on contemporary accelerator hardware (Tan et al., 2021).
1. Block Structure and Comparison with MBConv
Fused-MBConv modifies the traditional MBConv block used in MobileNet-like architectures. In standard MBConv, the computational flow is: 1×1 expansion convolution, depthwise k×k convolution, SE module, 1×1 projection convolution, and optional residual connection. In contrast, Fused-MBConv fuses the expansion (1×1) and depthwise k×k operations into a single regular k×k convolution, followed by a 1×1 projection convolution. The block omits the SE module entirely.
Structural Details
| Component | MBConv (Standard) | Fused-MBConv |
|---|---|---|
| Expansion | 1×1 conv (C_in→C_exp) | k×k conv (C_in→C_exp) |
| Depthwise convolution | k×k depthwise (C_exp) | — |
| Squeeze-and-Excite | Yes (ratio 0.25) | No |
| Projection | 1×1 conv (C_exp→C_out) | 1×1 conv (C_exp→C_out) |
| Residual | Yes (if stride=1, C_in=C_out) | Yes (if stride=1, C_in=C_out) |
From a workflow perspective, the Fused-MBConv reduces the number of convolutions and removes the SE path, thereby lowering architectural complexity and memory access overhead (Tan et al., 2021).
2. Formal Parameter and FLOP Cost Analysis
Let denote input channels, output channels (often ), the expansion ratio, , the kernel size (typically 3), and the spatial resolution.
Parameter Count
- Standard MBConv (ignoring SE):
- Fused-MBConv:
FLOPs
- MBConv:
- Fused-MBConv:
Relative Overhead
For ,
For small , the ratio tends to , indicating comparatively higher parameter cost for Fused-MBConv in shallow stages. At high , the ratio also approaches (Tan et al., 2021).
3. Implementation Example
A canonical PyTorch-style Fused-MBConv operator is structured as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
import torch import torch.nn as nn class FusedMBConv(nn.Module): """Fused-MBConv: conv(k×k)→BN→Swish→conv(1×1)→BN + optional residual.""" def __init__(self, in_ch, out_ch, expansion=4, kernel_size=3, stride=1): super().__init__() hidden_ch = in_ch * expansion self.use_res_connect = (stride == 1 and in_ch == out_ch) self.conv1 = nn.Conv2d(in_ch, hidden_ch, kernel_size=kernel_size, stride=stride, padding=(kernel_size//2), bias=False) self.bn1 = nn.BatchNorm2d(hidden_ch) self.conv2 = nn.Conv2d(hidden_ch, out_ch, kernel_size=1, bias=False) self.bn2 = nn.BatchNorm2d(out_ch) self.activation = nn.SiLU(inplace=True) # Swish def forward(self, x): out = self.conv1(x) out = self.bn1(out) out = self.activation(out) out = self.conv2(out) out = self.bn2(out) if self.use_res_connect: return x + out else: return out |
The block uses a sequence of regular convolution (for expansion) and projection convolution with Swish activation, batch normalization, and an optional residual connection (Tan et al., 2021).
4. Empirical Trade-offs and Throughput
Empirical evaluation within EfficientNet-B4–scale models demonstrates stage-dependent trade-offs as follows:
| Scenario (stages replaced with Fused-MBConv) | Params | FLOPs | TPU throughput | V100 throughput | Top-1 Accuracy |
|---|---|---|---|---|---|
| Stages 1–3 | 20.0M | 7.5B | 362 imgs/s | 216 imgs/s | 83.1% |
| Baseline (all MBConv) | 19.3M | 4.5B | 262 imgs/s | 155 imgs/s | 82.8% |
| All stages 1–7 | 132M | 34.4B | 254 imgs/s | — | 81.7% |
Selective use of Fused-MBConv in early stages (where is small) increases training throughput by 38% (TPU) and 39% (V100) with minimal parameter growth or accuracy drop versus MBConv. Replacing all MBConv blocks with Fused-MBConv leads to a substantial parameter/FLOP increase and a decrease in accuracy and throughput, indicating diminishing returns in later, wide-channel network regions. Fused-MBConv optimizes early inference and training efficiency, leveraging regular convolution primitives that are more heavily optimized on modern hardware (Tan et al., 2021).
5. Role in EfficientNetV2 Architecture Search and Scalability
The introduction of Fused-MBConv as a new search space primitive was instrumental in the EfficientNetV2 neural architecture search (NAS). The automated search algorithm predominantly selected Fused-MBConv for stages 1–3 (when is small), while retaining standard MBConv (with SE) in deeper stages to manage resource utilization. In the EfficientNetV2-S configuration, the stage allocation is:
- Stages 1–3: Fused-MBConv ( in stage 1, in stages 2–3)
- Stages 4–6: MBConv with SE ( or $6$)
This staged approach achieves a training step time 3× faster than EfficientNetV1 of similar size, supporting both faster convergence and improved parameter efficiency. The reliance on regular convolutions aligns with the performance advantages of dense kernel implementations in GPU/TPU environments. Overall, Fused-MBConv enables training time reductions of up to 11× and inference latencies up to 3× lower compared to previous baselines, without a parameter efficiency trade-off in the critical early layers (Tan et al., 2021).
6. Context, Limitations, and Hardware Considerations
Fused-MBConv demonstrates maximal benefit in shallow layers where input channel count is low. This is attributed to the avoidance of memory-bound depthwise convolution operations and the ability to exploit highly optimized routines for regular convolutions on GPU and TPU architectures. Excessive use throughout the network causes inefficient parameter scaling and reduced throughput, especially in wide, late-stage blocks. As such, Fused-MBConv is most effective when deployed selectively as part of a hybrid strategy with MBConv. This design choice is reflected in the optimal architectures discovered through NAS for EfficientNetV2 (Tan et al., 2021).
A plausible implication is that future architecture searches may further customize operator selection on a per-stage basis in resource-constrained and hardware-specific scenarios.