EfficientNetV2S: Efficient CNN Variant
- EfficientNetV2S is a compact convolutional neural network designed for parameter efficiency and rapid training using training-aware NAS and compound scaling.
- It utilizes a staged architecture with Fused-MBConv and MBConv blocks, integrating Swish activations and batch normalization for robust, hardware-efficient performance.
- On ImageNet-1K, it achieves 83.9% Top-1 accuracy with 22M parameters and serves effectively as a transferable backbone in tasks like plant disease classification.
EfficientNetV2S is the "small" variant of the EfficientNetV2 family of convolutional neural networks designed with a focus on both parameter efficiency and fast training. Developed using training-aware neural architecture search (NAS), EfficientNetV2S combines advanced building blocks, progressive learning strategies, and compound model scaling to achieve strong accuracy metrics on large-scale benchmarks while remaining computationally efficient. It is widely employed as a transferable backbone for domain-specific fine-tuning, especially in tasks with limited data availability.
1. Architectural Components
EfficientNetV2S employs a staged convolutional architecture optimized for fast training and inference. The network is composed of an initial convolutional layer followed by six principal blocks—Fused-MBConv and MBConv types—culminating in a 1×1 convolution and a dense classification head. The layer-wise specifications are as follows (Tan et al., 2021):
| Stage | Block Type | Kernel/Stride | Channels (In→Out) | Repeats |
|---|---|---|---|---|
| 0 | Conv 3×3 | 3×3 / 2 | 3 → 24 | 1 |
| 1 | Fused-MBConv, r=1 | 3×3 / 1 | 24 → 24 | 2 |
| 2 | Fused-MBConv, r=4 | 3×3 / 2 | 24 → 48 | 4 |
| 3 | Fused-MBConv, r=4 | 3×3 / 2 | 48 → 64 | 4 |
| 4 | MBConv, r=4, SE(0.25) | 3×3 / 2 | 64 → 128 | 6 |
| 5 | MBConv, r=6, SE(0.25) | 3×3 / 1 | 128 → 160 | 9 |
| 6 | MBConv, r=6, SE(0.25) | 3×3 / 2 | 160 → 256 | 15 |
| 7 | Conv1×1 + Pool + FC | 1×1 / – | 256 → 1280 | 1 |
The key block types are:
- MBConv (Mobile Inverted Residual Bottleneck): Utilizes expansion (1×1 conv), depthwise 3×3 convolution, optional squeeze-and-excitation, projection (1×1 conv), and skip connections.
- Fused-MBConv: Merges expansion and depthwise convs into a single 3×3 regular convolution followed by projection, improving performance in early layers and on edge/TPU hardware.
All convolutions use Swish activation unless specified otherwise, and batch normalization is applied after each convolutional operation (Tan et al., 2021).
2. Model Scaling and Progressive Learning
EfficientNetV2S employs the compound scaling strategy introduced in EfficientNetV1, scaling depth (), width (), and input resolution () according to user-specified coefficients and a global scaling factor , with the constraint . This allows for principled scaling across model sizes with a balanced tradeoff between accuracy and efficiency (Tan et al., 2021).
To accelerate training and improve generalization, progressive learning is incorporated: model training is split into stages with gradually increasing input resolutions and regularization strengths (dropout, RandAugment, mixup). For EfficientNetV2S, training progresses from input size, with corresponding increases in regularization—dropout rate from $0.1$ to $0.3$ and RandAugment magnitude from $5$ to $15$ (Tan et al., 2021).
3. Training Setup and Evaluation
EfficientNetV2S was originally benchmarked on ImageNet-1K with a training protocol involving RMSProp optimization (momentum $0.9$, decay $0.9$, weight decay ), batch normalization momentum $0.99$, and an exponential moving average decay of $0.9999$. Training was conducted for $350$ epochs with a batch size of $4096$ across $32$ TPUv3 cores, with a learning rate warm-up to $0.256$ followed by exponential decay (Tan et al., 2021). Regularization includes RandAugment, mixup, dropout, and stochastic depth (survival $0.8$). No Top-5 accuracies were directly reported, but Top-1 performance was emphasized.
On ImageNet-1K, EfficientNetV2S achieves Top-1 accuracy with $22$ million parameters and $8.8$ billion FLOPs, with an inference latency of $24$ ms per image (NVIDIA V100, FP16, batch $16$), and total training time of $7.1$ hours under the reported regime. This demonstrates improved parameter and training efficiency over EfficientNetV1 and competitive architectures, with V2-S outperforming V1-B4 ( Top-1, $19$M params) and being faster to train than V1-B3 (Tan et al., 2021).
4. Transfer Learning: Apple Leaf Disease Classification Case Study
A representative application of EfficientNetV2S as a transfer learning backbone is provided in the classification of apple leaf diseases (Ashmafee et al., 2023). In this domain-specific use case, EfficientNetV2S was employed as the feature extractor, initialized with ImageNet-1K pretrained weights and fine-tuned on the PlantVillage apple leaf subset. The key pipeline elements are:
- Input Preprocessing: All input images resized to and pixel values normalized to .
- Head Architecture: The $1280$-dimensional feature vector output from the final convolutional layer feeds into one or two stacks of DenseBatchNormDropout (dropout and width not numerically specified), followed by a final Dense layer with Softmax activation for four-class probability output.
- Augmentation: To address class imbalance, runtime (on-the-fly) augmentation was applied per batch, including random rotation, horizontal flip (), width/height shift, and shear transform, reducing the risk of test leakage and overfitting.
- Optimization: Training used the Adam optimizer () with reduction by a factor of $0.1$ on validation stagnation (patience ), early stopping (patience ), and batch size $32$.
Evaluation on a $60/20/20$ train/val/test split (total $3171$ images, $4$ classes) yielded test accuracy, with per-class precision, recall, and scores all exceeding . This result outperformed previous works employing ResNet18, LeNet, or FCNN-LDA architectures on the same dataset (Ashmafee et al., 2023).
5. Comparative Metrics and State-of-the-Art Performance
On large benchmarks, EfficientNetV2S demonstrates advantageous tradeoffs in accuracy, model size, and training/inference speed. The table below summarizes comparative ImageNet-1K performance (Tan et al., 2021):
| Model | Top-1 (%) | Params (M) | FLOPs (B) | Latency (ms) | Train Time (h) |
|---|---|---|---|---|---|
| V2-S | 83.9 | 22 | 8.8 | 24 | 7.1 |
| V2-M | 85.1 | 54 | 24 | 57 | 13 |
| V2-L | 85.7 | 120 | 53 | 98 | 24 |
| V1-B4 | 82.9 | 19 | 4.2 | 30 | 21 |
| V1-B7 | 84.7 | 66 | 38 | 170 | 139 |
EfficientNetV2-M matches the Top-1 accuracy of V1-B7 while being faster to train and faster to infer. Compared to recent Vision Transformer models (ViT/DeiT), EfficientNetV2S achieves over higher Top-1 accuracy with approximately half the FLOPs and faster training (Tan et al., 2021).
For transfer learning, EfficientNetV2S in plant disease detection surpasses previous methods (ResNet18, FCNN-LDA, LeNet-based) by in absolute accuracy, with runtime augmentation and slightly elevated input resolution shown to yield robust performance gains (Ashmafee et al., 2023).
6. Design Rationale and Recommendations
EfficientNetV2S is characterized by:
- An architecture that alternates Fused-MBConv (for early layers) and MBConv (for deeper layers with squeeze-and-excitation), leading to favorable hardware efficiency and representational power.
- Non-uniform stage scaling and progressive regularization, mitigating overfitting in deep networks while expediting training.
- Applicability as a pretrained backbone for transfer learning applications, with default configurations ("plug-and-play") sufficing for downstream fine-grained tasks when combined with shallow dense heads and aggressive runtime augmentation.
Evidence from domain applications indicates that direct reuse of EfficientNetV2S without pruning or topology changes is sufficient for high accuracy in transfer settings, contingent upon appropriate image resolution adaptation, data augmentation, and conservative optimization hyperparameters (Ashmafee et al., 2023).
7. Implications and Limitations
The use of EfficientNetV2S in transfer learning tasks, as demonstrated in plant disease classification, highlights its utility for tasks with modest sample sizes and class imbalances, given adequate runtime augmentation and pretraining. While not all regularization and optimizer details have universal generalizability (e.g., the original used RMSProp; domain adaptation may use Adam), the core architectural decisions and scaling principles remain applicable.
No direct head-to-head significance testing is presented in these transfer studies. Performance comparisons are based on absolute accuracy improvements. A plausible implication is that for domains similar in scale and complexity to PlantVillage, EfficientNetV2S will generally offer a strong baseline without substantial architecture modifications (Ashmafee et al., 2023, Tan et al., 2021).