Papers
Topics
Authors
Recent
2000 character limit reached

EfficientNet-B0: Compact CNN Architecture

Updated 19 December 2025
  • EfficientNet-B0 is a compact convolutional neural network architecture designed with compound scaling to balance depth, width, and resolution.
  • It utilizes MBConv blocks with Squeeze-and-Excitation modules, achieving 77.1% top-1 ImageNet accuracy with only 0.39B FLOPs.
  • Adapted for transfer learning, self-supervised distillation, and multioutput prediction, it serves versatile roles in scientific property prediction.

EfficientNet-B0 is a compact convolutional neural network architecture serving as the baseline model in the EfficientNet family, designed through neural architecture search and optimized by compound scaling rules. It achieves high accuracy and computational efficiency by balancing depth, width, and resolution, and has been successfully adapted for transfer learning, self-supervised learning via knowledge distillation, and multioutput scientific property prediction.

1. Compound Scaling and Architectural Foundations

EfficientNet-B0 implements a compound scaling method, wherein the depth (dd), width (ww), and input resolution (rr) of the network are scaled jointly using a single user-specified coefficient ϕ\phi. The scaling factors are defined as

d=αϕ,w=βϕ,r=γϕd = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi

where the base coefficients (α,β,γ)(\alpha, \beta, \gamma) for B0 are (1.2,1.1,1.15)(1.2, 1.1, 1.15), chosen such that αβ2γ22\alpha\,\beta^2\,\gamma^2 \approx 2. For ϕ=0\phi=0 (EfficientNet-B0), the scaling does not change the dimensionality, while larger ϕ\phi values yield the B1–B7 variants with increased model capacity and computational complexity (Tan et al., 2019).

EfficientNet-B0 was derived from a small neural architecture search targeting a trade-off between accuracy and computational cost (targeting ~400 M FLOPS and 77.1% top-1 ImageNet accuracy). The optimized baseline acts as a "seed" for further scalable variants using the compound scaling approach.

2. Layerwise Structure and Implementation Details

EfficientNet-B0 employs Mobile Inverted Bottleneck Convolution (MBConv) blocks augmented with Squeeze-and-Excitation (SE) modules (reduction ratio 0.25). Expansion factors are primarily 6, except for the initial MBConv block with expansion factor 1. The activation function is Swish (SiLU). Batch normalization momentum, regularization strategies (stochastic depth, linear dropout from 0.2 to 0.5), and AutoAugment are utilized during training (Tan et al., 2019).

Stage Operator Input Resolution Output Channels #Layers Expansion Kernel SE ratio
1 Conv3×3 (stride 2) 224×224 32 1 3×3
2 MBConv1 (stride 1) 224×224 16 1 ×1 3×3 0.25
3 MBConv6 (stride 2) 112×112 24 2 ×6 3×3 0.25
4 MBConv6 (stride 2) 112×112 40 2 ×6 5×5 0.25
5 MBConv6 (stride 2) 56×56 80 3 ×6 3×3 0.25
6 MBConv6 (stride 2) 28×28 112 3 ×6 5×5 0.25
7 MBConv6 (stride 1) 14×14 192 4 ×6 5×5 0.25
8 MBConv6 (stride 2) 14×14 320 1 ×6 3×3 0.25
9 Conv1×1 → Pool → FC 7×7 1280 → 1000 1 1×1

The network has 5.3 million parameters and requires 0.39 billion FLOPS for a standard 224×224224 \times 224 single-crop inference. ImageNet performance reaches 77.1% top-1 and 93.3% top-5 validation accuracy. Default optimizer for ImageNet is RMSProp (momentum 0.9, decay 0.9), with initial learning rate 0.256, weight decay 1×1051\times 10^{-5}, and early stopping on a minival set (Tan et al., 2019).

3. EfficientNet-B0 in Self-Supervised Learning and Knowledge Distillation

EfficientNet-B0's compact footprint makes it a candidate for self-supervised learning and knowledge distillation. In the DisCo framework, EfficientNet-B0 acts as the "student" model and is paired with larger ResNet "teacher" models, whose final 128-dimensional embeddings are distilled onto the student via a mean squared error (MSE) penalty. This objective is combined with contrastive InfoNCE loss: L=Ldis+λLco\mathcal{L} = \mathcal{L}_{dis} + \lambda\,\mathcal{L}_{co} where λ=1\lambda=1, Ldis\mathcal{L}_{dis} matches student and teacher embeddings on two augmentations, and Lco\mathcal{L}_{co} applies contrastive learning (MoCo-V2) (Gao et al., 2021).

EfficientNet-B0, with 4.0 million parameters and a 1280-dimensional pooled feature, uses an MLP projection head of 1280\rightarrow2048\rightarrow128 during pre-training. At inference, the MLP is discarded and only the backbone is used. With ResNet-101 as teacher, B0 achieves 68.9% linear evaluation accuracy on ImageNet, closely matching the teacher’s 69.1% (Gao et al., 2021). The methodology reveals a “Distilling BottleNeck” effect, remediated by increasing the projection MLP’s hidden dimension.

4. Multioutput Adaptations for Scientific Property Prediction

EfficientNet-B0 has been adapted for multioutput regression and classification in scientific domains, as exemplified by prediction of lithium manganese nickel oxide (LMNO) crystal properties (Wong et al., 2024). The model's classification head is replaced by four parallel "heads":

  • Regression for bandgap energy (EgE_g): y^=Wx+b\hat y = W x + b
  • Regression for energy above the convex hull (EhullE_{hull}): y^=Wx+b\hat y = W x + b
  • SoftMax classification for crystal systems (C=7C=7)
  • SoftMax classification for space groups (C=19C=19)

Losses for the four heads are summed: Ltotal=LMSE(Eg)+LMSE(Ehull)+LCE(systems)+LCE(space groups)\mathcal{L}_{\rm total} = \mathcal{L}_{\rm MSE}(E_g) + \mathcal{L}_{\rm MSE}(E_{hull}) + \mathcal{L}_{\rm CE}(\text{systems}) + \mathcal{L}_{\rm CE}(\text{space groups}) Images (224×\times224×\times3 RGB) are generated via random rotations, normalized, and passed to the network without handcrafted features. The adapted EfficientNet-B0 achieves R2=0.9773R^2=0.9773 for bandgap, R2=0.9650R^2=0.9650 for hull energy, and classification accuracies of 99.45% (systems) and 99.27% (space groups) (Wong et al., 2024).

5. Interpretability and Saliency Analysis

Gradient-based class saliency maps calculated via Keras-vis are used to interpret model attention within crystal structure tasks (Wong et al., 2024). The saliency, defined by

scorecinputp\frac{\partial \text{score}_{c}}{\partial \text{input}_{p}}

with respect to pixel pp and class cc, reveals:

  • High saliency in lattice regions occupied by larger ions
  • Distinct attention patterns for different crystal systems (structural similarity indices 0.74\lesssim 0.74)
  • Focus on overall lattice shape and symmetry axes as key discriminative features

This suggests that EfficientNet-B0’s hierarchical convolutional features capture physically meaningful motifs relevant to domain-specific property prediction.

6. Transferability, Benchmarking, and Efficiency

EfficientNet-B0 exhibits high transfer learning performance on multiple datasets, with an order of magnitude fewer parameters compared to prior architectures. The principled joint scaling of architecture dimensions maintains efficiency and accuracy across the B0–B7 model family. The design and parameterization of B0, backed by multi-objective neural architecture search, allow subsequent scaling without manual per-dimension modifications, preserving efficiency and competitive accuracy (Tan et al., 2019). Its lightweight nature (9.4%–16.3% the size of relevant teacher ResNets in distilled settings) facilitates deployment in resource-constrained environments.

7. Contexts of Use and Limitations

EfficientNet-B0 serves as a baseline for scalable ConvNet architectures, as a backbone for downstream tasks in scientific imaging, and as a lightweight student in self-supervised and distillation pipelines. While achieving near-teacher-level accuracy in distilled contrastive learning and outperforming non-distilled baselines by large margins (Gao et al., 2021), its compact width and depth can induce bottlenecks in representation learning without appropriate architectural adjustments. A plausible implication is that further improvements for small models require explicit architectural mitigation (e.g., increasing projection head hidden sizes) to maintain information richness.

EfficientNet-B0 continues to influence applications where computation and model size are constrained, and where direct end-to-end feature learning from raw image data is required.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EfficientNet-B0.