Fused-MBConv Blocks: Design & Efficiency

Updated 9 November 2025

Fused-MBConv blocks are architectural primitives that fuse channel expansion and spatial aggregation into a single convolution to optimize compute efficiency.
They accelerate training and inference in early network stages by leveraging enhanced hardware parallelism and reducing memory bandwidth constraints.
Recent advances in CUDA kernel fusion further reduce global memory access and energy consumption, significantly increasing throughput on GPUs and TPUs.

Fused-MBConv blocks are architectural primitives that merge the channel expansion and spatial aggregation steps of the inverted bottleneck (MBConv) design into a single convolutional operation, improving computational throughput and hardware utilization in modern deep neural networks. They originated from the development of the EfficientNetV2 family, where their integration in the early stages of the network led to significantly accelerated training and inference, particularly on hardware such as GPUs and TPUs. More recently, advances in CUDA kernel fusion and memory-traffic models have enabled algorithmically optimal implementations of Fused-MBConv and related modules, offering further reductions in memory access and energy consumption.

1. Motivation and Architectural Rationale

Standard MBConv blocks, introduced in MobileNetV2 and popularized by EfficientNet-V1, combine three convolutional steps: (1) a 1×1 pointwise “expansion” convolution, (2) a k×k depthwise convolution, (3) an optional squeeze-and-excitation (SE) module, and (4) a 1×1 projection convolution. While this design efficiently parameterizes channel expansion and spatial aggregation, the reliance on depthwise convolutions leads to suboptimal hardware utilization due to a low compute-to-memory-access ratio—especially problematic in early layers with large spatial maps.

Fused-MBConv was introduced to address this bottleneck by merging the first two convolutions into a single k×k dense convolution. This reorganization increases FLOPs but yields larger matrix-multiplication payloads per kernel invocation, enhancing accelerator efficiency. Consequently, Fused-MBConv offers faster training and inference in regimes where hardware parallelism and memory bandwidth are the primary scalability constraints (Tan et al., 2021).

2. Internal Structure and Mathematical Definition

A Fused-MBConv block is parameterized as follows:

$C_{in}$ : Number of input channels
$C_{out}$ : Number of output channels
$r$ : Expansion ratio ( $C_{mid} = r \cdot C_{in}$ )
$k$ : Convolution kernel size (typically $3 \times 3$ )
$W_{fused}$ : Weights of the fused $k \times k$ convolution, shape $[C_{mid}, C_{in}, k, k]$
$W_{proj}$ : Weights of the projection 1×1 convolution, shape $[C_{out}, C_{mid}, 1, 1]$

Forward propagation proceeds as:

Expansion and Spatial Convolution (Fused):

$y = \text{Swish}(\text{Conv}_{k \times k}(x; W_{fused}))$

Resulting shape: $[H/s, W/s, C_{mid}]$ .

Squeeze-and-Excitation (SE, optional, ratio 0.25):
- $a = \text{GlobalAvgPool}(y)$
- $s = \sigma(W_2(\delta(W_1 a)))$ , where $W_1:[C_{mid}/4 \to C_{mid}],\, W_2:[C_{mid} \to C_{mid}]$
- $y' = y \odot s$
Projection:

$z = \text{Conv}_{1 \times 1}(y'; W_{proj})$

Output: $[H/s, W/s, C_{out}]$ .

Residual:
- If $s = 1$ and $C_{out} = C_{in}$ , output $z + x$ ; else, output $z$ .

In the special case $r=1$ , $W_{fused}$ becomes a standard $3 \times 3$ convolution without channel expansion. No activation function is applied after the projection convolution. The structure preserves the inverted bottleneck's expressive capacity but executes with higher arithmetic intensity per operation (Tan et al., 2021).

3. Hardware Efficiency: Empirical Evaluation

A controlled ablation in EfficientNet-B4 compares networks with varying proportions of MBConv and Fused-MBConv blocks:

Stages with Fused-MBConv	Parameters	FLOPs	Throughput (imgs/sec/core, TPUv3)
None (all MBConv)	19.3 M	4.5 B	262
Stage 1–3	20.0 M	7.5 B	362
All Stages	132.0 M	34.4 B	254

Replacing MBConv with Fused-MBConv in early layers increases FLOPs, but throughput improves by ~1.4× due to better accelerator utilization. Full replacement (all stages) leads to excessive parameter growth, especially in later layers where spatial extents are small and channel counts large, ultimately degrading efficiency. This establishes the principle of using Fused-MBConv primarily in early, broad-feature stages where large workloads leverage hardware parallelism most effectively (Tan et al., 2021).

4. Automated Search and Model Integration

EfficientNetV2 integrates Fused-MBConv by incorporating it into the architecture search space with the following flexible parameters:

Operator: $\{\text{MBConv}, \text{Fused-MBConv}\}$
Kernel size: $\{3\times 3, 5\times 5\}$
Expansion ratio: $\{1, 4, 6\}$
Layer count per stage

Search reward is computed as $\text{Accuracy} \times \text{Speed}^w \times \text{Params}^v$ with $w = -0.07$ , $v = -0.05$ . The discovered EfficientNetV2-S model, as a canonical example, uses Fused-MBConv exclusively in the first three stages, switching to SE-equipped MBConv in later stages:

Stage	Op	Params	Stride	Layers
0	Conv3×3	24	2	1
1	Fused-MBConv₁	24→24 ( $k$ =3)	1	2
2	Fused-MBConv₄	24→48 ( $k$ =3)	2	4
3	Fused-MBConv₄	48→64 ( $k$ =3)	2	4
4–6	MBConv (SE 0.25)	64→256	varies	varies
7	Conv1×1 → Pool → FC	–	–	–

Scaling to EfficientNetV2-M/L/XL is achieved by compound scaling of depth, width, and resolution—with an upper bound on the maximum training image size (380–480 px) and a preference for adding new layers to later stages. This prevents bottlenecking on early stages already optimized by Fused-MBConv (Tan et al., 2021).

5. GPU Kernel Fusion and Practical Implementation

Recent work on Fused Convolutional Modules (FCMs) extends the principles of Fused-MBConv by proposing CUDA kernels that fuse depthwise and pointwise convolutions, guided by analytic memory-traffic cost models (FusePlanner) (Qararyah et al., 2024). The practical implementation on CUDA platforms follows:

Compute the depthwise and pointwise convolution operations in a single kernel launch, maximizing on-chip data reuse.
Maintain all intermediate activations in shared memory (“commBuffer”) and registers, never writing intermediate results to global DRAM.
Choose tiling parameters $(T_H, T_W, T_C, T_M)$ to fit all shared-memory buffers and weight tiles within L1/shared memory, ensuring grid occupancy matches or exceeds the number of streaming multiprocessors (SMs).
Use INT8 inference where accuracy permits, allowing larger tiles and more aggressive fusion.

Empirical results demonstrate:

Up to 83% reduction in global memory access compared to layer-by-layer cuDNN implementations.
Up to 3.7× speedup on bottleneck fused layers.
Full-model end-to-end inference speedups of 1.4–1.8×, and 30–65% reduction in energy consumption, relative to TVM+cuDNN baselines (Qararyah et al., 2024).

6. Comparative Summary: Standard MBConv vs. Fused-MBConv

Block Type	Expansion Phase	Spatial Conv	Subsequent Steps	Accelerator Utilization	Best Placement
MBConv	Conv1×1 ( $C_{in}\to C_{mid}$ )	Depthwise $k\times k$	SE, Conv1×1 ( $C_{mid}\to C_{out}$ )	Low (for large maps, depthwise = memory-bound)	Later, narrow stages
Fused-MBConv	Conv $k\times k$ ( $C_{in}\to C_{mid}$ )	–	SE, Conv1×1 ( $C_{mid}\to C_{out}$ )	High (dense conv leverages hardware better)	Early, broad stages

The fusion preserves architectural expressivity while optimizing resource utilization. A plausible implication is that further work on operator fusion and kernel scheduling could generalize the performance benefits of Fused-MBConv beyond convolutional networks to vision transformers and other compact DNNs (Qararyah et al., 2024).

7. Significance, Impact, and Limitations

Fused-MBConv is a foundational element in EfficientNetV2, enabling up to 6.8× reduction in parameters and 11× training speedup relative to previous state-of-the-art CNNs on ImageNet and transfer learning tasks, while matching or exceeding previous accuracies (Tan et al., 2021). However, the approach is architecture- and hardware-dependent: benefits are substantial only when used in wide, early layers with large spatial footprints, and parameter overhead constrains its application in late, channel-rich network segments. State-of-the-art GPU implementations require careful alignment of tiling, data precision, and shared memory constraints using analytic tools like FusePlanner to maximize realized hardware gains (Qararyah et al., 2024).

In summary, Fused-MBConv embodies both a conceptual and practical advance in the design and acceleration of deep convolutional architectures, acting as a template for future work on cross-layer kernel fusion in resource-constrained inference and training environments.