Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis (2409.06644v2)

Published 10 Sep 2024 in cs.CV and cs.AI

Abstract: Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While AI foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.

Citations (1)

Summary

  • The paper introduces EyeCLIP, a visual-language foundation model that integrates 2.77M multi-modal images and 11,180 reports for enhanced ophthalmic diagnosis.
  • It employs self-supervised reconstruction and contrastive learning techniques to achieve state-of-the-art performance across 14 benchmark datasets, excelling in zero-shot and few-shot settings.
  • The model extends its capabilities to predict systemic diseases and supports tasks like VQA and cross-modal retrieval, highlighting its broad clinical and research potential.

EyeCLIP: A Visual–Language Foundation Model for Multi-Modal Ophthalmic Image Analysis

Abstract

The paper presents EyeCLIP, a visual-language foundation model specifically designed for ophthalmic image analysis. The model is built upon a colossal dataset of 2.77 million multi-modal images and 11,180 reports from 128,554 patients, facilitating the integration of visual and textual data for enhanced performance in diagnosing eye and systemic diseases. The paper showcases EyeCLIP's proficiency in various downstream tasks such as disease classification, visual question answering (VQA), and cross-modal retrieval across 14 benchmark datasets. The experimental results demonstrate that EyeCLIP achieves state-of-the-art performance, outperforming existing models, and excels particularly in zero-shot and few-shot learning scenarios.

Introduction

Ophthalmic diseases like glaucoma, macular degeneration, and diabetic retinopathy significantly affect global vision health, often leading to severe impairments or blindness. The uneven distribution of medical resources exacerbates the problem, particularly in underserved regions. The advent of AI models, especially foundation models, has the potential to mitigate these issues through automatic analysis of ophthalmic images, thereby aiding in timely diagnosis and treatment. Existing models, such as RETFound, primarily focus on single modalities which restricts their real-world applicability. EyeCLIP addresses this limitation by learning a shared representation across multiple modalities like CFP, OCT, FFA, and FAF while incorporating ophthalmic text data to extend the model's diagnostic capabilities.

Methodology

EyeCLIP's architecture constitutes a combination of self-supervised reconstruction, multi-modal image contrastive learning, and image-text contrastive learning. These methodologies enable the model to learn from a diverse and partially labeled dataset, thus capturing the multi-view information needed for accurate diagnosis. The model features a CLIP-based framework with additional image decoding capabilities akin to Masked Autoencoders (MAE), enhancing its capacity to handle vast amounts of unlabeled data. The training strategy aligns multiple modalities and integrates textual descriptions, further refining the model's understanding and application across various clinical scenarios.

Results

Zero-shot and Few-shot Classification

EyeCLIP demonstrated significant superiority in zero-shot classification tasks across multiple datasets. For example, in diagnosing ophthalmic diseases via CFP images, EyeCLIP's AUC ranged from 0.681 to 0.757 for diabetic retinopathy and glaucoma, outperforming other models with statistical significance (P<0.001). Similar outstanding performance was observed with OCT images where EyeCLIP achieved an AUROC of 0.800 for OCTID and 0.776 for OCTDL datasets. In few-shot classification, EyeCLIP maintained high data efficiency, surpassing other models even with minimal training samples, as evidenced by its performance on the Retina Image Bank subset for rare diseases.

Full-data Fine-tuning

In the full-data supervised training paradigm, EyeCLIP outperformed competing models across the majority of single and multi-modality tasks. For example, in multi-modality classification on the AngioReport dataset, EyeCLIP achieved an AUROC of 0.721 compared to the next best model's 0.705 (P<0.001). The model's ability to generalize well on large, diverse datasets like Retina Image Bank underscores its potential for real-world clinical applications.

Systemic Disease Prediction

EyeCLIP extended its capabilities to predict systemic diseases based on ophthalmic images. Using the UK Biobank data, the model achieved AUROC scores of 0.641, 0.536, 0.580, and 0.596 for stroke, dementia, Parkinson's disease, and myocardial infarction respectively, surpassing other models (P<0.05). This highlights the model's potential beyond ophthalmology in systemic disease prediction.

Cross-modal Retrieval and VQA

EyeCLIP's architecture facilitated zero-shot cross-modal retrieval, a feature crucial for biomedical applications. The model outperformed BioMedCLIP, achieving higher recall rates on both AngioReport and Retina Image Bank datasets. In VQA, EyeCLIP, integrated with Llama2-7b, excelled in aligning image and language features, achieving top metrics on the OphthalVQA dataset without specialized alignment, indicating robust generalization ability.

Discussion

EyeCLIP offers significant improvements over existing ophthalmic foundation models by integrating multi-modal and multi-examinations. Unlike conventional models that focus on specific examination types, EyeCLIP's comprehensive approach is more suited for the varied and complex presentations in clinical settings. The model's integration of clinical text enhances its diagnostic capabilities, especially in zero-shot scenarios, making it highly valuable in resource-limited contexts.

EyeCLIP's advancements in systemic disease prediction emphasize the expanding role of ophthalmic imaging in general health monitoring. The model's proficiency in retrieving and interpreting data across modalities is a significant step towards more automated and precise medical care.

Limitations and Future Work

Despite its strengths, EyeCLIP's performance is tied to the quality and scope of its training data. Future work should focus on diversifying the training datasets to include broader demographic and clinical variations. Additionally, standardizing textual data in clinical reports could further enhance model performance. Practical and ethical considerations will be crucial for real-world deployment, including ensuring model transparency and interpretability for healthcare providers.

Conclusion

EyeCLIP represents a significant contribution to ophthalmic AI by effectively utilizing visual and textual data for comprehensive multi-modal image analysis. Its remarkable performance across various ophthalmic and systemic disease tasks highlights its potential as an invaluable tool in clinical practice and medical research. The methodologies developed in EyeCLIP offer valuable insights that can inform the creation of foundation models in other medical fields, extending the potential benefits of AI in healthcare.