EfficientNetV2-M: Medium-Scale CNN Model
- EfficientNetV2-M is a medium-scale convolutional neural network that balances accuracy and computational cost using compound scaling and progressive learning.
- It employs innovative Fused-MBConv blocks to enhance hardware utilization and integrate regularization strategies during training.
- With 54 million parameters and 24 billion FLOPs, it achieves an 85.1% top-1 accuracy on ImageNet while reducing training time significantly.
EfficientNetV2-M is a convolutional neural network model that forms part of the EfficientNetV2 family, designed for superior training speed and parameter efficiency relative to its predecessors and contemporary architectures. Developed using training-aware neural architecture search and progressive model scaling, EfficientNetV2-M leverages architectural innovations such as Fused-MBConv blocks to optimize hardware utilization and regularization strategies adapted for progressive image resizing during training. The model is positioned as a “Medium” size instance within the V2 series, balancing accuracy and computational costs for large-scale vision tasks (Tan et al., 2021).
1. Building-Block Primitives and Architectural Structure
EfficientNetV2-M is constructed by scaling up the EfficientNetV2-S “base” architecture via compound scaling. The base architecture is composed of two fundamental building blocks:
- MBConv block: Implements an expand-projection depthwise separable convolution stack (expand with convolution, depthwise convolution, project with convolution), with optional Squeeze-and-Excitation (SE) and residual connections with drop-connect. Expansion factors are , kernels , and SE ratio is 0.25.
- Fused-MBConv block: A single regular convolution that simultaneously expands feature maps and performs spatial convolution, followed by a convolution and residual connection. This yields better accelerator utilization, particularly in early layers.
Both blocks use SiLU (Swish) activations, batch normalization, and drop-connect (stochastic depth).
Base Architecture: Stage-Wise Summary
| Stage | Operator | Stride | Output Channels | Repeat |
|---|---|---|---|---|
| 0 | Conv | 2 | 24 | 1 |
| 1 | Fused-MBConv (, ) | 1 | 24 | 2 |
| 2 | Fused-MBConv (, ) | 2 | 48 | 4 |
| 3 | Fused-MBConv (, ) | 2 | 64 | 4 |
| 4 | MBConv (, , SE=0.25) | 2 | 128 | 6 |
| 5 | MBConv (, , SE=0.25) | 1 | 160 | 9 |
| 6 | MBConv (, , SE=0.25) | 2 | 256 | 15 |
| 7 | Conv Pool FC | — | 1280 | 1 |
[Adapted from (Tan et al., 2021)]
2. Compound Scaling and V2-M Derivation
EfficientNetV2-M is derived by applying compound scaling to the base model. Compound scaling simultaneously scales network depth , width , and input resolution using the relations:
where for V2-M and typical choices are , , . V2-M is capped at a maximum resolution of . Each stage’s channel counts and block repeats are computed as scaled and rounded versions of the base.
This scaling leads to:
- Parameters: 54 million
- FLOPs: 24 billion (for input)
- Input resolution at inference:
3. Training Protocols and Regularization
The training of EfficientNetV2-M is structured around progressive learning, which schedules image resolution and regularization magnitude to increase linearly during training. The training protocol for V2-M is specifically characterized by:
- Progressive Learning: Four stages, each spanning approximately 87 epochs, with image sizes increasing from .
- Regularization Scheduling:
- Dropout rate: Increasing linearly from 0.10 to 0.40
- RandAugment magnitude: 5 to 20
- Mixup : 0.00 to 0.20
- Stochastic depth survival probability: fixed at 0.8
Optimization is performed with RMSProp (momentum 0.9, decay 0.9), weight decay , batch normalization momentum 0.99, and batch size 4096 (TPU). The learning rate is warmed up from 0 to 0.256 over 5000 steps, followed by exponential decay by 0.97 every 2.4 epochs. An exponential moving average of weights is maintained with decay 0.9999.
4. Model Efficiency and Computation
EfficientNetV2-M is designed for enhanced efficiency:
- Parameters: 54 million
- FLOPs: 24B (per forward pass at inference, resolution)
- Inference Latency: ≃57 milliseconds per image (Nvidia V100, FP16, batch 16)
- Training Time: 13 hours for 350 epochs on ImageNet-1k, using 32 TPUv3 cores with progressive learning
For comparison, EfficientNet-V1-B7 requires 139 hours for training under similar conditions, and V1-B6 requires 75 hours, both with lower top-1 accuracy.
5. Performance Metrics and Transfer Results
On the ImageNet ILSVRC2012 dataset (1.28M train, 50K val), EfficientNetV2-M achieves:
- Top-1 Accuracy: 85.1%
- Top-5 Accuracy: ≃97.3%
Transfer learning (finetuned from ImageNet-1k) yields:
| Dataset | V2-M Top-1 (%) | V1-B7 Top-1 (%) | ViT-B/16 Top-1 (%) |
|---|---|---|---|
| CIFAR-10 | 99.0 ±0.08 | 98.9 | 98.1 |
| CIFAR-100 | 92.2 ±0.08 | 91.7 | 87.1 |
| Flowers102 | 98.5 ±0.08 | 98.8 | 89.5 |
| Stanford Cars | 94.6 ±0.10 | 94.7 | – |
6. Comparative Analysis and Trade-offs
EfficientNetV2-M demonstrates a top-1 accuracy approximately 1.0% higher than a comparably sized V1 model, while attaining a 6× speedup in training. At a fixed computational budget (24B FLOPs), V2-M yields a superior accuracy-latency balance relative to larger hybrid architectures, including ResNeSt and NFNet, as measured on standard benchmarks.
A plausible implication is that the introduction of Fused-MBConv blocks and progressive regularization scheduling are central in reconciling accelerator utilization with high generalization performance, though the architectural details and empirical metrics are strictly as reported.
7. Key Formulas and Block Structure
The essential architectural building blocks can be formalized as follows:
- MBConv:
(with SE and drop-connect integrated in the sequence)
- Fused-MBConv:
These formulations define the core computational paths and are repeated according to stagewise configuration, with parameter counts and repeat factors set by compound scaling rules detailed above (Tan et al., 2021).