Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

RETFound Transformer in Retinal Imaging

Updated 14 November 2025
  • RETFound Transformer is a family of retinal-domain foundation models built on Vision Transformers that extract robust features from fundus and OCT images.
  • It employs self-supervised pretraining objectives like Masked Autoencoding and self-distillation to learn effective representations from nearly one million unlabelled images.
  • The models achieve high performance in ocular disease detection, dense segmentation, and oculomics, while offering computational efficiency and adaptability for diverse retinal tasks.

RETFound Transformer is a family of retinal-domain foundation models based on Vision Transformers (ViTs) and pre-trained in a self-supervised fashion on large-scale color fundus and optical coherence tomography (OCT) images. RETFound models offer a paradigm shift in medical imaging by enabling label-efficient adaptation (fine-tuning or linear probing) to diverse retinal analysis tasks—including ocular disease detection, systemic disease prediction (oculomics), and, as demonstrated recently, dense segmentation tasks such as optic disc delineation. Their development addresses the inherent limitations in the availability of annotated medical data by leveraging masked autoencoding or self-distillation to learn robust retinal representations from nearly a million unlabelled images.

1. Model Architecture and Variants

RETFound adopts a ViT-Large backbone as per Dosovitskiy et al. (2020), featuring 24 Transformer encoder layers, an embedding dimension D=1024D=1024, and H=16H=16 self-attention heads (each with dh=64d_h=64). Patch embedding is performed on non-overlapping 16×1616\times16 input regions, producing $1024$-dimensional tokens. Positional encodings are learned and added to each patch token. The architectural standardization occurs across RETFound variants; the primary distinction is the pretraining objective:

  • RETFound-MAE: Employs a Masked Autoencoder (MAE) pretext task, replacing 75%75\% of input patches with a mask token and reconstructing them with a lightweight pixel-level decoder.
  • RETFound-DINOv2: Uses self-distillation (DINOv2) with heavy data augmentation, employing a teacher-student architecture to maximize view-invariant representations.

LayerNorm is applied in pre-norm style before attention and MLP layers. The backbone is flexible to downstream tasks; the original classification head (used for ocular/systemic disease detection) can be removed and replaced with task-specific adapters or segmentation decoders.

RETFound-Green is a data- and parameter-efficient variant based on ViT-Small (12 layers, d=384d=384, H=6H=6); it employs a lightweight token reconstruction pretext objective and four auxiliary register tokens for attention stability. This reduction yields dramatic gains in computational and environmental efficiency without systematic performance loss (Engelmann et al., 30 Apr 2024).

2. Pretraining Objectives and Data

RETFound pretraining occurs in two phases:

  • Phase 1: Natural image pretraining on ImageNet-1k (1.4\sim1.4M images) with MAE, forming strong generic visual priors.
  • Phase 2: Retinal-domain pretraining on \sim900,000 color fundus photos and \sim736,000 OCT volumes (for the main ViT-Large model). During MAE pretraining, 75%75\% of patches are masked, and the network reconstructs pixel values for missing regions, minimizing mean squared error. For RETFound-DINOv2, self-distillation leverages teacher-student consistency via augmented views.

RETFound-Green is trained solely on 75,000 public fundus images, employing Gaussian pixel noise and patch erasure as corruption, with a token-level quadratic reconstruction loss: Lrec=1Ntokdγh(z)z022\mathcal{L}_{\rm rec} = \frac{1}{N_{\rm tok}\,d}\|\gamma\odot h(z')-z_0\|_2^2 where z0z_0 and zz' are the clean and corrupted backbone outputs, h()h(\cdot) a projection head, and γ\gamma a learned gating vector.

This large-scale self-supervision endows the ViT backbone with domain-specific recognition of retinal features—vessel branching, foveal structure, lesion morphologies—without requiring explicit annotation.

3. Adaptation Strategies for Downstream Tasks

For classification (disease detection), RETFound uses the final [CLS] token, feeding it into a lightweight, single-layer MLP for prediction. Fine-tuning applies to all weights or, in linear probing, only the classifier is trained. Key adaptation protocols:

  • Fine-tuning (classification tasks): All ViT parameters are updated. Data augmentations include random flipping, rotation (±\pm30°), and scaling/zoom, with images resized to 224×224224\times224 or 256×256256\times256.
  • Linear probing: Only the final FC head is trained, offering extreme sample and compute efficiency.
  • Segmentation (e.g., optic disc delineation): RETFound is frozen as an encoder; the classification head is dropped. A two-block Mask Transformer (adapted from Segmenter [Strudel et al., 2021]) serves as the decoder, using one mask token per class. The decoder jointly attends over patch and mask tokens, outputting scalar products per class, followed by bilinear upsampling and softmax to 224×224224\times224 pixel masks.

Losses for segmentation aggregate Dice and binary cross-entropy terms: Ltotal=LDice+LBCEL_{\rm total} = L_{\rm Dice} + L_{\rm BCE} where

LDice=12TP2TP+FP+FN,LBCE=1Ni[yilogy^i+(1yi)log(1y^i)]L_{\rm Dice} = 1 - \frac{2\,TP}{2\,TP+FP+FN}, \quad L_{\rm BCE} = -\frac{1}{N}\sum_i [y_i\log\hat y_i + (1 - y_i)\log(1-\hat y_i)]

Hyperparameters for fine-tuning include Adam optimizer, learning rates in the 10310^{-3} to 10410^{-4} range, batch sizes 4–32, and early stopping on validation set performance. Resource-efficient learning (no learning rate decay for segmentation) avoids “grokking” delays.

4. Quantitative Performance and Comparative Analysis

RETFound-based systems exhibit high data efficiency, competitive accuracy, and robust generalization properties across diverse tasks.

Optic Disc Segmentation (Zhao et al., 15 Aug 2025): RetFound with a 2-block Mask Transformer decoder, trained with as few as 50–100 cases, achieves Dice coefficients \sim96% on five datasets (GoDARTS, IDRID, Drishti-GS, RIM-ONE-r3, REFUGE), matching or exceeding task-specific CNN and Transformer segmentation baselines. For domain generalization (leave-one-out across datasets), spatial augmentation yields average Dice of 95.53%, outperforming top baselines (DOFE 92.57%, TVConv 93.87%). For domain adaptation, RETFound equals or outperforms S-CUDA, ISFA, ECSD-Net across source-target pairs.

Ocular/Systemic Disease Detection (Yew et al., 21 Jan 2025, Hou et al., 10 Feb 2025, Zhou et al., 3 Sep 2025): With ample labeled data, RETFound and strong traditional supervised models (ResNet50, ViT-base, SwinV2) achieve statistically indistinguishable AUCs (0.92–0.97) for ocular tasks. In extreme low-data regimes, RETFound achieves AUC gains (++0.1–0.2) for diabetic retinopathy and glaucoma over ResNet50. For systemic disease predictions (heart failure, myocardial infarction, stroke), RETFound's advantage is marked: e.g., heart failure AUROC 0.796 vs. DINOv2-Large 0.767 (p << 0.001).

Head-to-head with large natural-image foundation models (DINOv2/DINOv3): On retinal disease detection, DINOv2-Large slightly outperforms RETFound, especially on diabetic retinopathy and multi-class tasks, but RETFound surpasses all DINOv2 and DINOv3 models for oculomic endpoints and maintains superior label efficiency for systemic prediction (heart failure: AlzEye AUROC 0.796 vs. DINOv2 0.771; UK Biobank 0.674 vs. 0.615–0.623).

RETFound-Green, despite being \sim14× smaller and requiring \sim400× less compute in pretraining, outperformed prior retinal foundation models in the majority of tasks across datasets from Brazil, India, and China (Engelmann et al., 30 Apr 2024).

5. Computational Footprint and Practical Considerations

RETFound-ViT-Large models impose GPU memory requirements similar to other ViT-Large architectures: \sim60–66 GiB for batch size 24 during fine-tuning (classification task), with inference throughput of \sim61 imgs/s (compared to ResNet50 101 imgs/s, SwinV2 70 imgs/s). Linear probes require << 1,000 trainable parameters, resulting in minimal memory and fast runtime.

RETFound-Green materially reduces these demands: 83.8 MB model size (vs. 1.12 GB), feature extraction at 16 img/s (vs. 6 img/s in baseline), embedding storage at 14.6 GB per million images (vs. 39.1 GB), and a pretraining carbon footprint of 0.4 kg CO2e (vs. 81–234 kg for prior models).

Thus, in resource-constrained environments or eco-sensitive deployments, RETFound-Green presents an attractive trade-off.

6. Insights, Domain Implications, and Future Directions

RETFound demonstrates that large-scale self-supervised pre-training in the retinal domain enables strong label and adaptation efficiency, especially when task labels are sparse or fine-grained features (e.g., subtle vascular pathology) are required for systemic prediction. Ablation studies confirm that spatial augmentations (matching pretraining augs), combined BCE+Dice loss, and fixed LR schedules are optimal for segmentation adapters.

Comparative studies with generalist vision FMs (DINOv2/DINOv3) indicate the specialist advantage narrows as pretraining scale rises, but, as of 2025, RETFound-DINOv2 embeddings still yield lower intra-class feature similarity and higher data efficiency for oculomics. Hybrid strategies—combining generalist pretraining with short domain-specific adaptation—may synthesize broad transfer learning with retinal specificity.

Future research directions include extending RETFound and its adapters to further retinal tasks (vessel segmentation, lesion detection), scaling multimodal pretraining (image + EHR), and open benchmarking for real-world deployment beyond fundus/OCT, as well as more sustainable, compute-efficient training objectives as exemplified by RETFound-Green.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RETFound Transformer.