EfficientNetV2: Optimized CNN Architecture
- EfficientNetV2 is a family of CNNs that optimizes training speed, parameter efficiency, and accuracy using innovative operators like Fused-MBConv and training-aware NAS.
- It employs a progressive learning strategy that adaptively increases resolution and regularization, significantly reducing training time while boosting performance.
- Empirical evaluations show state-of-the-art results on benchmarks like ImageNet, outperforming comparable CNN and transformer models in speed and accuracy.
EfficientNetV2 is a family of convolutional neural networks designed to optimize training speed, parameter efficiency, and accuracy for large-scale visual recognition tasks. Originating as a successor to the original EfficientNet, EfficientNetV2 introduces fundamental architectural and methodological innovations, including a richer operator set, training-aware neural architecture search (NAS), compound model scaling adapted for modern accelerators, and a progressive learning schedule that adaptively adjusts regularization in tandem with input resolution. Empirical evaluations demonstrate that EfficientNetV2 models achieve significantly faster training—up to 11×—while maintaining or exceeding top-1 accuracy compared to both convolutional and transformer-based models on ImageNet, CIFAR, Flowers, and Cars benchmarks (Tan et al., 2021).
1. Architectural Innovations and Operator Choices
The foundational building blocks of EfficientNetV2 are the MBConv (Mobile Inverted Bottleneck Convolution) and the novel Fused-MBConv operators.
The MBConv block comprises:
- A 1×1 pointwise expansion convolution (expansion ratio )
- A 3×3 depthwise convolution
- A 1×1 pointwise projection
- Optional Squeeze-and-Excitation (SE) gating
- A skip connection when the input and output dimensions match
While MBConvs are FLOPs- and parameter-efficient, their depthwise layers can bottleneck throughput on modern hardware, particularly in early network stages where spatial resolution is high. To address this, Fused-MBConv merges the expansion and depthwise operations into a single (fused) 3×3 regular convolution, optionally followed by a 1×1 projection. Concretely:
- For , a 3×3 convolution is followed by a 1×1 convolution
- For , the order is reversed
This operator mitigates memory bottlenecks, achieving up to 40% higher throughput in early stages with only minimal increases in parameter count and computational cost. EfficientNetV2 networks employ a stage-wise combination of MBConv and Fused-MBConv blocks, with kernel sizes , expansion ratios , and optional SE gating, all subject to search via NAS.
2. Training-Aware Neural Architecture Search and Compound Scaling
EfficientNetV2 introduces a training-aware NAS objective that incorporates:
- Validation accuracy after brief training
- Normalized per-step training time
- Parameter count
The reward function for each candidate architecture is
This balances accuracy, speed, and size, penalizing 1% increases in training time or parameters as much as a 1% loss in accuracy. Approximately 1,000 architectures are sampled and the backbone with maximal reward is selected as EfficientNetV2-S.
Scaling to medium and large variants (M, L, XL) employs a modified compound scaling scheme:
Where are the multipliers for depth, width, and input resolution, and is the global compound coefficient. Compared to the original EfficientNet, EfficientNetV2 imposes a cap of to restrict memory usage and further biases additional depth to later network stages (especially stages 5–6) to preserve early-stage throughput.
3. Progressive Learning and Adaptive Regularization
EfficientNetV2 introduces a progressive learning strategy where image resolution and regularization are both adaptively increased over multiple training stages. Training is divided into stages (typically ), with input resolutions and several regularization magnitudes interpolated linearly from minimum to maximum values:
Here, and are starting and ending image sizes; , are respective regularization strengths for dropout, RandAugment, or mixup. For example, V2-L is trained from 128 to 380 pixels, dropout 0.1 to 0.5, RandAugment 5 to 25, and mixup from 0 to 0.4. Ablation studies show that this approach recovers over 0.7% top-1 accuracy lost to naive progressive resizing or random resolution sampling, while reducing training time by 30%–65% on large models.
4. Training Protocol and Benchmark Performance
EfficientNetV2 models are trained on ImageNet ILSVRC2012 with the following protocol:
- Optimizer: RMSProp (decay 0.9, momentum 0.9)
- Batch size: 4096
- Epochs: 350
- Learning rate: warmup to 0.256, exponential decay (factor 0.97 every 2.4 epochs)
- BatchNorm momentum: 0.99
- Weight decay: 1e-5
- Exponential moving average: 0.9999
- Data augmentation: RandAugment, mixup, dropout, stochastic depth (survival rate 0.8)
Training is performed on 32 TPUv3 cores; inference is measured on V100 GPU (FP16, batch 16).
| Model Variant | Top-1 Accuracy | Parameters | FLOPs | Train Time | Inference Latency |
|---|---|---|---|---|---|
| EfficientNetV2-S | 83.9% | 22 M | 8.8 B | 7.1 h | 24 ms |
| EfficientNetV2-M | 85.1% | 54 M | 24 B | 13 h | 57 ms |
| EfficientNetV2-L | 85.7% | 120 M | 53 B | 24 h | 98 ms |
| V2-S (ImageNet21k) | 84.9% | 22 M | 8.8 B | 9.0 h | 24 ms |
| V2-M (ImageNet21k) | 86.2% | 54 M | 24 B | 15 h | 57 ms |
| V2-L (ImageNet21k) | 86.8% | 120 M | 53 B | 26 h | 98 ms |
| V2-XL (ImageNet21k) | 87.3% | 208 M | 94 B | 45 h | — |
On transfer learning tasks (no ImageNet21k pretraining), V2-L achieves mean top-1 of 99.1% on CIFAR-10, 92.3% on CIFAR-100, 98.8% on Flowers, and 95.1% on Cars. EfficientNetV2 outperforms both convolutional baselines and Vision Transformers of comparable or larger scale.
5. Applications and Extensions
EfficientNetV2 architectures have been adapted to domains beyond natural image classification. For example, in the super-resolution of digital elevation models (DEMs), an EfficientNetV2-S-inspired backbone outperformed previous interpolation and deep learning methods in mean squared error (MSE). In this application, the architecture preserved the full spatial resolution through the body before two pixel-shuffle upsampling stages and a high-resolution residual skip, achieving a 16× upsampling factor with an MSE reduction of approximately 20% relative to prior CNN and GAN architectures (Demiray et al., 2021).
6. Insights, Limitations, and Future Directions
EfficientNetV2 demonstrates that co-designing architectural choices, operator selection, and training procedures can lead to models that exhibit both accelerated convergence and strong generalization. The work highlights the importance of memory and throughput considerations in operator design, the value of explicitly optimizing training time during NAS, and the empirical gains from progressive, adaptive learning schedules.
Identified limitations include the requirement to recompile computational graphs at each new resolution on TPU hardware (partially addressed by stagewise resizing), as well as the increased complexity in hyperparameter tuning for regularization schedules. Future research directions include automated metalearning approaches to schedule design, extending training-aware NAS to optimize quantization or sparsity objectives, and applying the architectural principles to dense prediction tasks such as object detection and semantic segmentation, especially where early-layer computational efficiency is crucial.
EfficientNetV2’s suite of design principles and empirical performance reinforce the relevance of convolution-based models as strong contenders to transformer-based vision models for both efficiency and accuracy in large-scale visual learning (Tan et al., 2021, Demiray et al., 2021).