EfficientNetV2-M: Medium-Scale CNN Model

Updated 15 December 2025

EfficientNetV2-M is a medium-scale convolutional neural network that balances accuracy and computational cost using compound scaling and progressive learning.
It employs innovative Fused-MBConv blocks to enhance hardware utilization and integrate regularization strategies during training.
With 54 million parameters and 24 billion FLOPs, it achieves an 85.1% top-1 accuracy on ImageNet while reducing training time significantly.

EfficientNetV2-M is a convolutional neural network model that forms part of the EfficientNetV2 family, designed for superior training speed and parameter efficiency relative to its predecessors and contemporary architectures. Developed using training-aware neural architecture search and progressive model scaling, EfficientNetV2-M leverages architectural innovations such as Fused-MBConv blocks to optimize hardware utilization and regularization strategies adapted for progressive image resizing during training. The model is positioned as a “Medium” size instance within the V2 series, balancing accuracy and computational costs for large-scale vision tasks (Tan et al., 2021).

1. Building-Block Primitives and Architectural Structure

EfficientNetV2-M is constructed by scaling up the EfficientNetV2-S “base” architecture via compound scaling. The base architecture is composed of two fundamental building blocks:

MBConv block: Implements an expand-projection depthwise separable convolution stack (expand with $1\times1$ convolution, depthwise $k\times k$ convolution, project with $1\times1$ convolution), with optional Squeeze-and-Excitation (SE) and residual connections with drop-connect. Expansion factors are $e\in\{1,4,6\}$ , kernels $k\in\{3,5\}$ , and SE ratio is 0.25.
Fused-MBConv block: A single regular $k\times k$ convolution that simultaneously expands feature maps and performs spatial convolution, followed by a $1\times1$ convolution and residual connection. This yields better accelerator utilization, particularly in early layers.

Both blocks use SiLU (Swish) activations, batch normalization, and drop-connect (stochastic depth).

Base Architecture: Stage-Wise Summary

Stage	Operator	Stride	Output Channels	Repeat
0	Conv $3\times3$	2	24	1
1	Fused-MBConv ( $e=1$ , $k=3$ )	1	24	2
2	Fused-MBConv ( $e=4$ , $k=3$ )	2	48	4
3	Fused-MBConv ( $e=4$ , $k=3$ )	2	64	4
4	MBConv ( $e=4$ , $k=3$ , SE=0.25)	2	128	6
5	MBConv ( $e=6$ , $k=3$ , SE=0.25)	1	160	9
6	MBConv ( $e=6$ , $k=3$ , SE=0.25)	2	256	15
7	Conv $1\times1$ $\rightarrow$ Pool $\rightarrow$ FC	—	1280	1

[Adapted from (Tan et al., 2021)]

2. Compound Scaling and V2-M Derivation

EfficientNetV2-M is derived by applying compound scaling to the base model. Compound scaling simultaneously scales network depth $d$ , width $w$ , and input resolution $r$ using the relations:

$d = d_0 \cdot \alpha^\phi,\quad w = w_0 \cdot \beta^\phi,\quad r = r_0 \cdot \gamma^\phi,\quad \text{s.t.}\;\; \alpha\cdot\beta^2\cdot\gamma^2 \approx 2,\quad \alpha,\beta,\gamma \geq 1$

where $\phi=1$ for V2-M and typical choices are $\alpha=1.2$ , $\beta=1.1$ , $\gamma=1.15$ . V2-M is capped at a maximum resolution of $380\times380$ . Each stage’s channel counts and block repeats are computed as scaled and rounded versions of the base.

This scaling leads to:

Parameters: 54 million
FLOPs: 24 billion (for $380\times 380$ input)
Input resolution at inference: $380\times380$

3. Training Protocols and Regularization

The training of EfficientNetV2-M is structured around progressive learning, which schedules image resolution and regularization magnitude to increase linearly during training. The training protocol for V2-M is specifically characterized by:

Progressive Learning: Four stages, each spanning approximately 87 epochs, with image sizes increasing from $128 \rightarrow 380$ .
Regularization Scheduling:
- Dropout rate: Increasing linearly from 0.10 to 0.40
- RandAugment magnitude: 5 to 20
- Mixup $\alpha$ : 0.00 to 0.20
- Stochastic depth survival probability: fixed at 0.8

Optimization is performed with RMSProp (momentum 0.9, decay 0.9), weight decay $1\mathrm{e}{-5}$ , batch normalization momentum 0.99, and batch size 4096 (TPU). The learning rate is warmed up from 0 to 0.256 over 5000 steps, followed by exponential decay by 0.97 every 2.4 epochs. An exponential moving average of weights is maintained with decay 0.9999.

4. Model Efficiency and Computation

EfficientNetV2-M is designed for enhanced efficiency:

Parameters: 54 million
FLOPs: 24B (per forward pass at inference, $380\times380$ resolution)
Inference Latency: ≃57 milliseconds per image (Nvidia V100, FP16, batch 16)
Training Time: 13 hours for 350 epochs on ImageNet-1k, using 32 TPUv3 cores with progressive learning

For comparison, EfficientNet-V1-B7 requires 139 hours for training under similar conditions, and V1-B6 requires 75 hours, both with lower top-1 accuracy.

5. Performance Metrics and Transfer Results

On the ImageNet ILSVRC2012 dataset (1.28M train, 50K val), EfficientNetV2-M achieves:

Top-1 Accuracy: 85.1%
Top-5 Accuracy: ≃97.3%

Transfer learning (finetuned from ImageNet-1k) yields:

Dataset	V2-M Top-1 (%)	V1-B7 Top-1 (%)	ViT-B/16 Top-1 (%)
CIFAR-10	99.0 ±0.08	98.9	98.1
CIFAR-100	92.2 ±0.08	91.7	87.1
Flowers102	98.5 ±0.08	98.8	89.5
Stanford Cars	94.6 ±0.10	94.7	–

6. Comparative Analysis and Trade-offs

EfficientNetV2-M demonstrates a top-1 accuracy approximately 1.0% higher than a comparably sized V1 model, while attaining a $\sim$ 6× speedup in training. At a fixed computational budget (24B FLOPs), V2-M yields a superior accuracy-latency balance relative to larger hybrid architectures, including ResNeSt and NFNet, as measured on standard benchmarks.

A plausible implication is that the introduction of Fused-MBConv blocks and progressive regularization scheduling are central in reconciling accelerator utilization with high generalization performance, though the architectural details and empirical metrics are strictly as reported.

7. Key Formulas and Block Structure

The essential architectural building blocks can be formalized as follows:

MBConv:

$x_e = \mathrm{Conv}_{1\times1}(x),\quad x_d = \mathrm{DWConv}_{k\times k}(x_e),\quad y = x + \mathrm{Conv}_{1\times1}(x_d)$

(with SE and drop-connect integrated in the sequence)

Fused-MBConv:

$x_f = \mathrm{Conv}_{k\times k}(x),\quad y = x + \mathrm{Conv}_{1\times1}(x_f)$

These formulations define the core computational paths and are repeated according to stagewise configuration, with parameter counts and repeat factors set by compound scaling rules detailed above (Tan et al., 2021).

PDF Markdown Chat (Pro)

References (1)

EfficientNetV2: Smaller Models and Faster Training (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EfficientNetV2-M.