Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 183 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Fused-MBConv Blocks: Design & Efficiency

Updated 9 November 2025
  • Fused-MBConv blocks are architectural primitives that fuse channel expansion and spatial aggregation into a single convolution to optimize compute efficiency.
  • They accelerate training and inference in early network stages by leveraging enhanced hardware parallelism and reducing memory bandwidth constraints.
  • Recent advances in CUDA kernel fusion further reduce global memory access and energy consumption, significantly increasing throughput on GPUs and TPUs.

Fused-MBConv blocks are architectural primitives that merge the channel expansion and spatial aggregation steps of the inverted bottleneck (MBConv) design into a single convolutional operation, improving computational throughput and hardware utilization in modern deep neural networks. They originated from the development of the EfficientNetV2 family, where their integration in the early stages of the network led to significantly accelerated training and inference, particularly on hardware such as GPUs and TPUs. More recently, advances in CUDA kernel fusion and memory-traffic models have enabled algorithmically optimal implementations of Fused-MBConv and related modules, offering further reductions in memory access and energy consumption.

1. Motivation and Architectural Rationale

Standard MBConv blocks, introduced in MobileNetV2 and popularized by EfficientNet-V1, combine three convolutional steps: (1) a 1×1 pointwise “expansion” convolution, (2) a k×k depthwise convolution, (3) an optional squeeze-and-excitation (SE) module, and (4) a 1×1 projection convolution. While this design efficiently parameterizes channel expansion and spatial aggregation, the reliance on depthwise convolutions leads to suboptimal hardware utilization due to a low compute-to-memory-access ratio—especially problematic in early layers with large spatial maps.

Fused-MBConv was introduced to address this bottleneck by merging the first two convolutions into a single k×k dense convolution. This reorganization increases FLOPs but yields larger matrix-multiplication payloads per kernel invocation, enhancing accelerator efficiency. Consequently, Fused-MBConv offers faster training and inference in regimes where hardware parallelism and memory bandwidth are the primary scalability constraints (Tan et al., 2021).

2. Internal Structure and Mathematical Definition

A Fused-MBConv block is parameterized as follows:

  • CinC_{in}: Number of input channels
  • CoutC_{out}: Number of output channels
  • rr: Expansion ratio (Cmid=rCinC_{mid} = r \cdot C_{in})
  • kk: Convolution kernel size (typically 3×33 \times 3)
  • WfusedW_{fused}: Weights of the fused k×kk \times k convolution, shape [Cmid,Cin,k,k][C_{mid}, C_{in}, k, k]
  • WprojW_{proj}: Weights of the projection 1×1 convolution, shape [Cout,Cmid,1,1][C_{out}, C_{mid}, 1, 1]

Forward propagation proceeds as:

  1. Expansion and Spatial Convolution (Fused):

y=Swish(Convk×k(x;Wfused))y = \text{Swish}(\text{Conv}_{k \times k}(x; W_{fused}))

Resulting shape: [H/s,W/s,Cmid][H/s, W/s, C_{mid}].

  1. Squeeze-and-Excitation (SE, optional, ratio 0.25):
    • a=GlobalAvgPool(y)a = \text{GlobalAvgPool}(y)
    • s=σ(W2(δ(W1a)))s = \sigma(W_2(\delta(W_1 a))), where W1:[Cmid/4Cmid],W2:[CmidCmid]W_1:[C_{mid}/4 \to C_{mid}],\, W_2:[C_{mid} \to C_{mid}]
    • y=ysy' = y \odot s
  2. Projection:

z=Conv1×1(y;Wproj)z = \text{Conv}_{1 \times 1}(y'; W_{proj})

Output: [H/s,W/s,Cout][H/s, W/s, C_{out}].

  1. Residual:
    • If s=1s = 1 and Cout=CinC_{out} = C_{in}, output z+xz + x; else, output zz.

In the special case r=1r=1, WfusedW_{fused} becomes a standard 3×33 \times 3 convolution without channel expansion. No activation function is applied after the projection convolution. The structure preserves the inverted bottleneck's expressive capacity but executes with higher arithmetic intensity per operation (Tan et al., 2021).

3. Hardware Efficiency: Empirical Evaluation

A controlled ablation in EfficientNet-B4 compares networks with varying proportions of MBConv and Fused-MBConv blocks:

Stages with Fused-MBConv Parameters FLOPs Throughput (imgs/sec/core, TPUv3)
None (all MBConv) 19.3 M 4.5 B 262
Stage 1–3 20.0 M 7.5 B 362
All Stages 132.0 M 34.4 B 254

Replacing MBConv with Fused-MBConv in early layers increases FLOPs, but throughput improves by ~1.4× due to better accelerator utilization. Full replacement (all stages) leads to excessive parameter growth, especially in later layers where spatial extents are small and channel counts large, ultimately degrading efficiency. This establishes the principle of using Fused-MBConv primarily in early, broad-feature stages where large workloads leverage hardware parallelism most effectively (Tan et al., 2021).

4. Automated Search and Model Integration

EfficientNetV2 integrates Fused-MBConv by incorporating it into the architecture search space with the following flexible parameters:

  • Operator: {MBConv,Fused-MBConv}\{\text{MBConv}, \text{Fused-MBConv}\}
  • Kernel size: {3×3,5×5}\{3\times 3, 5\times 5\}
  • Expansion ratio: {1,4,6}\{1, 4, 6\}
  • Layer count per stage

Search reward is computed as Accuracy×Speedw×Paramsv\text{Accuracy} \times \text{Speed}^w \times \text{Params}^v with w=0.07w = -0.07, v=0.05v = -0.05. The discovered EfficientNetV2-S model, as a canonical example, uses Fused-MBConv exclusively in the first three stages, switching to SE-equipped MBConv in later stages:

Stage Op Params Stride Layers
0 Conv3×3 24 2 1
1 Fused-MBConv₁ 24→24 (kk=3) 1 2
2 Fused-MBConv₄ 24→48 (kk=3) 2 4
3 Fused-MBConv₄ 48→64 (kk=3) 2 4
4–6 MBConv (SE 0.25) 64→256 varies varies
7 Conv1×1 → Pool → FC

Scaling to EfficientNetV2-M/L/XL is achieved by compound scaling of depth, width, and resolution—with an upper bound on the maximum training image size (380–480 px) and a preference for adding new layers to later stages. This prevents bottlenecking on early stages already optimized by Fused-MBConv (Tan et al., 2021).

5. GPU Kernel Fusion and Practical Implementation

Recent work on Fused Convolutional Modules (FCMs) extends the principles of Fused-MBConv by proposing CUDA kernels that fuse depthwise and pointwise convolutions, guided by analytic memory-traffic cost models (FusePlanner) (Qararyah et al., 30 Apr 2024). The practical implementation on CUDA platforms follows:

  • Compute the depthwise and pointwise convolution operations in a single kernel launch, maximizing on-chip data reuse.
  • Maintain all intermediate activations in shared memory (“commBuffer”) and registers, never writing intermediate results to global DRAM.
  • Choose tiling parameters (TH,TW,TC,TM)(T_H, T_W, T_C, T_M) to fit all shared-memory buffers and weight tiles within L1/shared memory, ensuring grid occupancy matches or exceeds the number of streaming multiprocessors (SMs).
  • Use INT8 inference where accuracy permits, allowing larger tiles and more aggressive fusion.

Empirical results demonstrate:

  • Up to 83% reduction in global memory access compared to layer-by-layer cuDNN implementations.
  • Up to 3.7× speedup on bottleneck fused layers.
  • Full-model end-to-end inference speedups of 1.4–1.8×, and 30–65% reduction in energy consumption, relative to TVM+cuDNN baselines (Qararyah et al., 30 Apr 2024).

6. Comparative Summary: Standard MBConv vs. Fused-MBConv

Block Type Expansion Phase Spatial Conv Subsequent Steps Accelerator Utilization Best Placement
MBConv Conv1×1 (CinCmidC_{in}\to C_{mid}) Depthwise k×kk\times k SE, Conv1×1 (CmidCoutC_{mid}\to C_{out}) Low (for large maps, depthwise = memory-bound) Later, narrow stages
Fused-MBConv Convk×kk\times k (CinCmidC_{in}\to C_{mid}) SE, Conv1×1 (CmidCoutC_{mid}\to C_{out}) High (dense conv leverages hardware better) Early, broad stages

The fusion preserves architectural expressivity while optimizing resource utilization. A plausible implication is that further work on operator fusion and kernel scheduling could generalize the performance benefits of Fused-MBConv beyond convolutional networks to vision transformers and other compact DNNs (Qararyah et al., 30 Apr 2024).

7. Significance, Impact, and Limitations

Fused-MBConv is a foundational element in EfficientNetV2, enabling up to 6.8× reduction in parameters and 11× training speedup relative to previous state-of-the-art CNNs on ImageNet and transfer learning tasks, while matching or exceeding previous accuracies (Tan et al., 2021). However, the approach is architecture- and hardware-dependent: benefits are substantial only when used in wide, early layers with large spatial footprints, and parameter overhead constrains its application in late, channel-rich network segments. State-of-the-art GPU implementations require careful alignment of tiling, data precision, and shared memory constraints using analytic tools like FusePlanner to maximize realized hardware gains (Qararyah et al., 30 Apr 2024).

In summary, Fused-MBConv embodies both a conceptual and practical advance in the design and acceleration of deep convolutional architectures, acting as a template for future work on cross-layer kernel fusion in resource-constrained inference and training environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Fused-MBConv Blocks.