nnMobileNet++: Hybrid CNN-ViT for Retinal Analysis

Updated 9 December 2025

nnMobileNet++ is a hybrid CNN-ViT architecture that integrates dynamic convolution and transformer modules to capture both local details and global context in retinal images.
It introduces dynamic snake convolution in Stage 2 to preserve thin vascular structures and boundary continuity during downsampling.
Domain-specific self-supervised pretraining using SimMIM and a stage-wise design yield state-of-the-art accuracy with low computational cost on multiple retinal benchmarks.

nnMobileNet++ is a four-stage hybrid convolutional neural network (CNN) and Vision Transformer (ViT) architecture designed for efficient and highly accurate retinal image analysis. Developed to address the limitations of purely convolutional models in capturing long-range dependencies and modeling the complex anatomical structures in retinal images, nnMobileNet++ bridges local and global feature representation via stage-wise integration of dynamic convolutional and transformer modules, underpinned by domain-specific self-supervised pretraining (Li et al., 1 Dec 2025).

1. Architectural Overview

nnMobileNet++ preserves the lightweight and efficient principles of the original nnMobileNet, extending its representational capacity through a hybrid design. The network is organized into four distinct stages, summarized below:

Stage	Output Resolution	Block Type
Stem	224×224	3×3 Conv, ReLU6
1	112×112	IRLB (DW 3×3 + PW 1×1)
2	56×56	IRLB + Dynamic Snake Conv
3	28×28	Conv↓ + ViT (MHSA + FFN)
4	14×14	Conv↓ + ViT (MHSA + FFN)
Head	1×1	GlobalPool + Linear FC

Stages 1–2: Employ Inverted Residual Linear Bottleneck (IRLB) blocks with a depthwise 3×3 followed by pointwise 1×1 convolutions, retaining computational efficiency.
Dynamic Snake Convolution (DSC): Replaces standard downsampling convolution at the end of Stage 2 to preserve boundary continuity in curvilinear structures such as vessels.
Stages 3–4: Transition to transformer-based representation using local convolutions, multi-head self-attention (MHSA), and feed-forward networks (FFN), with feature map dimensions and channel depth adjusted per stage.
Final layers: A global pooling operation followed by a fully connected (linear) prediction head.

Total spatial downsampling across the network is ×32, and channels increase progressively by stage (Li et al., 1 Dec 2025).

2. Dynamic Snake Convolution (DSC)

Dynamic Snake Convolution is introduced to address the fragmentation of thin, tortuous vascular features in retinal images often caused by conventional downsampling. At the Stage 2 downsampling point, DSC is applied as follows:

$y(x_0) = \sum_{i=1}^{K} w_i \cdot x\left(x_0 + p_i + \Delta p_i\right)$

where

$K$ : Number of sampling points (e.g., 9 for 3×3 kernel)
$w_i$ : Convolution weights
$p_i$ : Fixed grid offsets
$\Delta p_i$ : Learnable offsets, predicted from the input feature map

DSC learns deformable, adaptive sampling “snakes” along vessel-like paths, preserving vessel continuity and boundary sharpness with minimal computational overhead. This operation is applied only once, at the critical downsampling transition after Stage 2 (Li et al., 1 Dec 2025).

3. Stage-Specific Transformer Modules

Following Stage 2, nnMobileNet++ converts to a transformer-based representation for global context encoding:

The feature map $X \in \mathbb{R}^{56 \times 56 \times 32}$ is downsampled to $X' \in \mathbb{R}^{28 \times 28 \times 64}$ .
$X'$ is tokenized and positionally embedded to form $Z_0 = \mathrm{Flatten}(X') + E_{\mathrm{pos}}$ .
Each transformer block consists of layer normalization, MHSA, and a two-layer MLP with GELU activation and expansion ratio 2:

$Z'_\ell = \mathrm{MHSA}(\mathrm{LN}(Z_{\ell-1})) + Z_{\ell-1}, \quad Z_\ell = \mathrm{FFN}(\mathrm{LN}(Z'_\ell)) + Z'_\ell$

Embedding dimension: $d_{\mathrm{model}} = 64$ (Stage 3), $128$ (Stage 4)
Number of attention heads: 4 (Stage 3), 8 (Stage 4)

The transformer output is then reshaped back into a feature map and fused with the local convolution branch via elementwise addition, propagating both local and global cues downstream (Li et al., 1 Dec 2025).

4. Domain-Specific Pretraining via SimMIM

nnMobileNet++ employs retinal-specific self-supervised pretraining using SimMIM masked image modeling on 114,275 UK Biobank fundus images. The pretext task involves:

Masking 60% of non-overlapping $32 \times 32$ patches in each image.
Reconstructing the RGB pixel values at masked locations by minimizing the masked L1 loss:

$\mathcal{L}_{\mathrm{SimMIM}} = \frac{1}{|\mathcal{I}_M|} \sum_{i \in \mathcal{I}_M} \|\hat{X}_i - X_i\|_1$

where $\mathcal{I}_M$ indexes masked patches.

Pretraining is conducted with AdamW (lr=1e-3, weight decay=0.05), cosine decay, batch size 32, 800 epochs, and automatic mixed precision (AMP). A plausible implication is that this regimen imparts strong structural priors tailored to retinal imagery, enhancing out-of-distribution generalization (Li et al., 1 Dec 2025).

5. Training Protocols and Benchmark Evaluation

The training and evaluation pipeline consists of:

Inputs resized to $224 \times 224$ , with standard fundus normalization.
Classification finetuning for 300 epochs with AdamW (lr=1e-3, weight decay=5e-4, cosine decay schedule, batch 32).
Baseline and ablation models are trained from scratch using the timm library.

Datasets and tasks include:

MuReD (multi-label, 2,208 images)
ODIR (multi-class, 7,000 images)
MMAC 2023 (myopic maculopathy grading)
UWF4DR 2024 (ultra-widefield DR & DME)
MuCaRD 2025 (multi-camera robustness, 5-fold CV)

6. Quantitative Results and Ablation

nnMobileNet++ demonstrates state-of-the-art or highly competitive results across public datasets while maintaining low computational cost:

Model	AUC (MuReD)	F1 (MuReD)	FLOPs (GMac)	Params (M)
nnMobileNet	0.690	0.091	0.428	3.522
MobileViT	0.718	0.123	1.420	4.929
Swin-Tiny	0.714	0.218	4.371	27.500
nnMobileNet++ (scratch)	0.805	0.334	2.538	2.100
nnMobileNet++ (+SSL)	0.842	0.445	2.538	2.100

Model	AUC (ODIR)	F1	Acc.
MobileNetV2	0.869	0.304	0.449
Swin-Transformer	0.856	0.277	0.442
nnMobileNet++ (scratch)	0.873	0.295	0.445
nnMobileNet++ (+SSL)	0.906	0.494	0.536

On the MMAC 2023 dataset, ablation studies quantify incremental improvements:

Adding transformer blocks (ViT): +0.009 AUC, +0.046 F1.
Incorporating DSC: +0.007 AUC, +0.039 F1.
Self-supervised pretraining: +0.011 AUC, +0.016 F1.

nnMobileNet++ consistently achieves comparable or superior AUC, F1, and accuracy/AUPRC to both CNN and pure transformer baselines on five public retinal analysis benchmarks, with computation cost well below 3 GMacs and 2.1M parameters (Li et al., 1 Dec 2025).

7. Significance of Hybrid CNN–ViT Approach in Retinal Analysis

The hybrid architecture of nnMobileNet++ addresses limitations inherent to purely convolutional or transformer-only models for retinal imaging:

DSC ensures faithful capture of thin, irregular vessel and lesion boundaries during spatial reduction.
MHSA layers in transformer modules encode long-range dependencies critical for detecting distributed retinal features such as microaneurysms or hemorrhages.
Self-supervised pretraining on retinal data imparts domain priors and augments robustness to dataset heterogeneity and limited supervision.
The stage-wise integration methodology balances local spatial detail and global contextual awareness, accounting for both accuracy and efficiency in resource-constrained settings.

Together, these mechanisms establish nnMobileNet++ as an effective, lightweight framework for clinical retinal image analysis and support broader investigation of hybrid architectures in medical imaging (Li et al., 1 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

nnMobileNet++: Towards Efficient Hybrid Networks for Retinal Image Analysis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to nnMobileNet++.