Papers
Topics
Authors
Recent
2000 character limit reached

RETFound Models in Ophthalmic Imaging

Updated 4 December 2025
  • RETFound models are a family of domain-specific retinal vision foundation models based on ViT architectures using self-supervised techniques such as masked autoencoding and self-distillation.
  • They leverage millions of fundus and OCT images to achieve robust performance in ocular disease detection, anatomical segmentation, and oculomics with high label efficiency.
  • Comparative evaluations reveal nuanced trade-offs in performance, efficiency, and resource demands versus generalist models, highlighting benefits in domains like fine-grained DR grading and multimodal fusion.

RETFound models are a family of large, self-supervised retinal vision foundation models designed to serve as universal backbones for diverse ophthalmic imaging tasks, including eye-specific disease detection, oculomics (systemic disease prediction), and anatomical structure segmentation. Built on the vision transformer (ViT) architecture and pretrained on millions of unlabeled fundus or OCT images, RETFound models pursue generalizable representations of retinal structure through generative masked image modeling, with recent variants incorporating self-distillation. Domain-specific pretraining and large parameter counts characterize RETFound’s principal design strategy relative to generalist, natural-image FMs. However, empirical evaluations reveal nuanced trade-offs in performance, efficiency, and resource requirements compared to compact generalist models and alternative pretraining regimes.

1. Model Architecture and Pretraining Paradigms

RETFound models are based on ViT architectures at varying scales: predominantly ViT-Large (ViT-L/16, ~303–307 million parameters, 24 encoder blocks, patch size 16×16, embedding d=1024) and some deployments with ViT-Base (12 blocks, d=768) or ViT-Small (12 blocks, d=384–768). The canonical pretraining objective is masked autoencoding (MAE), wherein a high proportion (typically 75–80%) of input image patches are masked and a lightweight transformer decoder reconstructs pixel values of the masked regions. The mean squared error loss over masked patches is minimized:

LMAE=1MiMxix^i22\mathcal{L}_{\mathrm{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \lVert x_i - \hat{x}_i \rVert_2^2

where M\mathcal{M} indexes the masked patches (Engelmann et al., 30 Apr 2024Zhao et al., 15 Aug 2025Isztl et al., 27 Nov 2025).

Recent expansions include the use of self-distillation (DINOv2) objectives on retinal data (RETFound-DINOv2), in which a student ViT is trained to match the output distribution of a teacher ViT via Kullback–Leibler divergence on multiple augmented crops:

LDINO=Ex  KL(σ(zteacher(x)/T)σ(zstudent(x~)/T))\mathcal{L}_{\mathrm{DINO}} = \mathbb{E}_{x}\;\mathrm{KL}\bigl(\sigma(z_\mathrm{teacher}(x)/T)\,\|\,\sigma(z_\mathrm{student}(\tilde x)/T)\bigr)

Alternative data- and compute-efficient variants have emerged: RETFound-Green employs a ViT-S backbone and a token reconstruction loss, matching the high-level, projected DINOv2 token embeddings of a frozen reference encoder using only 75,000 images and orders of magnitude less compute than prior MAE-based variants. This eliminates the generative MAE decoder and instead reconstructs latent representations (Engelmann et al., 30 Apr 2024).

Pretraining data exclusively comprises unlabeled retinal images, typically color fundus photographs (~900,000–1,600,000) and/or OCT B-scans (~730,000). In some pipelines, a preceding ImageNet-1k stage is employed, but primary emphasis is on retina-specific signal. No synthetic or GAN-generated data are used in RETFound’s original formulation (Zhao et al., 15 Aug 2025Hou et al., 10 Feb 2025).

2. Task Adaptation: Fine-Tuning, Linear Probing, and Segmentation

Downstream tasks leverage the pretrained ViT encoder in several modes:

  • Supervised Fine-Tuning (SFT): All encoder and head weights are adapted via cross-entropy (classification) or combined Dice-binary cross-entropy loss (segmentation) (Zhao et al., 15 Aug 2025Arellano et al., 8 Oct 2025).
  • Linear Probing (LP): The RETFound encoder is frozen; a regularized linear classifier is fit to CLS token embeddings or pooled features. Multiple regularization strengths and optional PCA are explored (Arellano et al., 8 Oct 2025).
  • Segmentation: For structural segmentation (e.g., optic disc), the MAE decoder is replaced with a “mask transformer” segmentation head (e.g., 2 transformer blocks plus mask tokens), with only the decoder head trained, achieving high Dice coefficients (>95%) even with <100 labeled images (Zhao et al., 15 Aug 2025).
  • Multimodal Fusion: In multimodal systems (e.g., HyMNet), RETFound’s features are combined (post-linear projection) with demographic vectors, with subsequent fully connected fusion heads trained end-to-end (Baharoon et al., 2023).

For classification, common data augmentations include rotation, flipping, color jitter, and Gaussian blur. Augmentation strength and strategies are tuned to maximize AUC-PR on validation data (Arellano et al., 8 Oct 2025). Fine-tuning employs optimizers such as Adam(W), learning rates typically selected by grid search or adaptive schedules, with batch sizes spanning 4 (for high-resolution OCT) to 32.

3. Comparative Evaluation: Conventional Models, Generalist FMs, and RETFound

Direct empirical comparisons have been conducted across a spectrum of benchmarks:

  • Ocular Disease Detection (DR, DME, Glaucoma):
    • On full and moderate-size datasets, RETFound’s performance (AUC, F1) is generally equivalent to or slightly below ImageNet-pretrained CNNs (EfficientNet-B0, ResNet50, SwinV2) and general-purpose ViTs for most ocular diseases (AUC differences 0.0–1.5 points), except under severe class imbalance or fine-grained DR grading, where RETFound’s pretraining yields a modest benefit (+1.5 pp in 5-class DR) (Isztl et al., 27 Nov 2025Arellano et al., 8 Oct 2025Yew et al., 21 Jan 2025).
    • Linear probing on frozen RETFound features, without fine-tuning, is always markedly inferior (AUC-PR typically <0.8, AUC-ROC <0.9 for DME detection) (Arellano et al., 8 Oct 2025).
  • Systemic Disease Prediction (Oculomics):
    • RETFound surpasses generalist models and CNNs in predicting incident heart failure, myocardial infarction, and stroke (AUROC gains +0.05–0.08), especially with limited labeled data (≤400 images) (Yew et al., 21 Jan 2025Hou et al., 10 Feb 2025).
    • The retina-specific representations confer label efficiency and calibration robustness for subtle, diffuse systemic signals.
  • Segmentation:
    • With only decoder head training and tens–hundreds of labeled images, RETFound achieves ~96% Dice (optic disc), outperforming or matching state-of-the-art segmentation networks even in domain-generalization and adaptation settings (Zhao et al., 15 Aug 2025).
  • Generalist FMs (DINOv2, DINOv3):
    • Specialist RETFound-DINOv2 models consistently outperform DINOv2-ViT-giant and DINOv3-ViT-large generalist FMs on fine-tuned ocular and systemic disease tasks, with mean AUROC improvements of +0.014 to +0.030 (Zhou et al., 3 Sep 2025). However, the performance difference is narrowing as generalist FMs scale in data and parameters.

4. Efficiency, Compute, and Environmental Metrics

RETFound models’ efficiency and computational burden vary significantly by variant:

Model Params (M) Pretrain Data (# images) Pretrain Compute (A100*days) File Size (GB) Inference Speed (img/s)
RETFound-MEH 303 904,170 112 1.12 6
DERETFound 303 150,786 (+1M synth) 163 1.12 6
RETFound-Green 22 75,000 0.27 0.09 16

RETFound-Green consumes <$100 for full pretraining, necessitates only 0.39 kg CO₂e, and produces smaller, faster embeddings. Despite this efficiency, linear probes on RETFound-Green outperform or tie prior, much larger/wasteful FMs in most downstream settings, particularly for fine-grained DR grading (Engelmann et al., 30 Apr 2024).

In downstream transfer, inference speed and storage can be limiting for the standard RETFound (ViT-L), with >10× greater resource requirements than compact models (SwinV2-tiny, ConvNeXtV2-tiny). Only for the most challenging tasks does the higher resource demand become justified (Isztl et al., 27 Nov 2025).

5. Limitations, Adaptation Strategies, and Prospective Improvements

  • Task-specific Gaps: RETFound underperforms CNN baselines in detecting subtle, localized pathologies such as DME when trained with the generic MAE objective, likely due to the lack of locality and translation equivariance and the absence of explicit lesion-centric pretext tasks (Arellano et al., 8 Oct 2025).
  • Label Efficiency Ceiling: Linear probing of RETFound alone is insufficient for discriminative tasks; extensive fine-tuning or adoption of adaptation modules (adapters, prompt tuning) is mandatory in low-data regimes (Arellano et al., 8 Oct 2025Turkan et al., 7 Nov 2025).
  • Generalist vs. Specialist FM Boundaries: DINOv2- or DINOv3-based generalist FMs approach RETFound’s performance in classical ophthalmic classification, occasionally exceeding it in DR/glaucoma, but RETFound remains superior for oculomics and highly subtle systemic tasks (Hou et al., 10 Feb 2025Zhou et al., 3 Sep 2025).
  • Vision–Language Integration: Architectures such as RetFiner demonstrate that appending vision–language heads and cross-modal losses (ITC, ITM, MLM, GM) to RETFound produces semantically richer and more robust OCT representations, yielding 5.8 pp gains in balanced accuracy on downstream linear probes (Fecso et al., 27 Jun 2025). Pooling strategies concatenating CLS and patch-averaged features maximize discriminability.
  • Segmentation Extensions: RETFound’s strong segmentation adaptation with parameter-efficient decoder heads suggests similar strategies can generalize to vessels, multilabel, and multimodal settings, under minimal annotation (Zhao et al., 15 Aug 2025).

6. Impact, Clinical Utility, and Future Directions

RETFound catalyzed the diffusion of domain-specific foundation models in ophthalmology, establishing that self-supervised ViT-Large models pretrained on millions of images can drive state-of-the-art performance for both structural and diagnostic tasks—especially in limited-label, cross-domain, or systemic prediction contexts (Yew et al., 21 Jan 2025Shi et al., 18 May 2024). RETFound’s representation is robust to domain shifts and annotation sparseness, and the lack of synthetic data enhances clinician trust.

Recent developments highlight the importance of optimal trade-offs between domain specificity, model scale, computational cost, and cross-task generalizability. Derivatives like RETFound-Green render foundation models more equitable and sustainable by reducing data and compute barriers (Engelmann et al., 30 Apr 2024), while RETFound’s success in multimodal and segmentation transfer motivate future research into multimodal, cross-attentional self-supervision and hierarchical vision-structure fusion. In parallel, the narrowing gap between large-scale generalist FMs and specialist models signals a possible convergence in optimal FM design for future medical AI, pending further investigation on scale, cross-modal representation, and task coverage (Zhou et al., 3 Sep 2025Shi et al., 18 May 2024).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RETFound Models.