- The paper introduces EyeCLIP, a visual-language foundation model that integrates 2.77M multi-modal images and 11,180 reports for enhanced ophthalmic diagnosis.
- It employs self-supervised reconstruction and contrastive learning techniques to achieve state-of-the-art performance across 14 benchmark datasets, excelling in zero-shot and few-shot settings.
- The model extends its capabilities to predict systemic diseases and supports tasks like VQA and cross-modal retrieval, highlighting its broad clinical and research potential.
EyeCLIP: A Visual–Language Foundation Model for Multi-Modal Ophthalmic Image Analysis
Abstract
The paper presents EyeCLIP, a visual-language foundation model specifically designed for ophthalmic image analysis. The model is built upon a colossal dataset of 2.77 million multi-modal images and 11,180 reports from 128,554 patients, facilitating the integration of visual and textual data for enhanced performance in diagnosing eye and systemic diseases. The paper showcases EyeCLIP's proficiency in various downstream tasks such as disease classification, visual question answering (VQA), and cross-modal retrieval across 14 benchmark datasets. The experimental results demonstrate that EyeCLIP achieves state-of-the-art performance, outperforming existing models, and excels particularly in zero-shot and few-shot learning scenarios.
Introduction
Ophthalmic diseases like glaucoma, macular degeneration, and diabetic retinopathy significantly affect global vision health, often leading to severe impairments or blindness. The uneven distribution of medical resources exacerbates the problem, particularly in underserved regions. The advent of AI models, especially foundation models, has the potential to mitigate these issues through automatic analysis of ophthalmic images, thereby aiding in timely diagnosis and treatment. Existing models, such as RETFound, primarily focus on single modalities which restricts their real-world applicability. EyeCLIP addresses this limitation by learning a shared representation across multiple modalities like CFP, OCT, FFA, and FAF while incorporating ophthalmic text data to extend the model's diagnostic capabilities.
Methodology
EyeCLIP's architecture constitutes a combination of self-supervised reconstruction, multi-modal image contrastive learning, and image-text contrastive learning. These methodologies enable the model to learn from a diverse and partially labeled dataset, thus capturing the multi-view information needed for accurate diagnosis. The model features a CLIP-based framework with additional image decoding capabilities akin to Masked Autoencoders (MAE), enhancing its capacity to handle vast amounts of unlabeled data. The training strategy aligns multiple modalities and integrates textual descriptions, further refining the model's understanding and application across various clinical scenarios.
Results
Zero-shot and Few-shot Classification
EyeCLIP demonstrated significant superiority in zero-shot classification tasks across multiple datasets. For example, in diagnosing ophthalmic diseases via CFP images, EyeCLIP's AUC ranged from 0.681 to 0.757 for diabetic retinopathy and glaucoma, outperforming other models with statistical significance (P<0.001). Similar outstanding performance was observed with OCT images where EyeCLIP achieved an AUROC of 0.800 for OCTID and 0.776 for OCTDL datasets. In few-shot classification, EyeCLIP maintained high data efficiency, surpassing other models even with minimal training samples, as evidenced by its performance on the Retina Image Bank subset for rare diseases.
Full-data Fine-tuning
In the full-data supervised training paradigm, EyeCLIP outperformed competing models across the majority of single and multi-modality tasks. For example, in multi-modality classification on the AngioReport dataset, EyeCLIP achieved an AUROC of 0.721 compared to the next best model's 0.705 (P<0.001). The model's ability to generalize well on large, diverse datasets like Retina Image Bank underscores its potential for real-world clinical applications.
Systemic Disease Prediction
EyeCLIP extended its capabilities to predict systemic diseases based on ophthalmic images. Using the UK Biobank data, the model achieved AUROC scores of 0.641, 0.536, 0.580, and 0.596 for stroke, dementia, Parkinson's disease, and myocardial infarction respectively, surpassing other models (P<0.05). This highlights the model's potential beyond ophthalmology in systemic disease prediction.
Cross-modal Retrieval and VQA
EyeCLIP's architecture facilitated zero-shot cross-modal retrieval, a feature crucial for biomedical applications. The model outperformed BioMedCLIP, achieving higher recall rates on both AngioReport and Retina Image Bank datasets. In VQA, EyeCLIP, integrated with Llama2-7b, excelled in aligning image and language features, achieving top metrics on the OphthalVQA dataset without specialized alignment, indicating robust generalization ability.
Discussion
EyeCLIP offers significant improvements over existing ophthalmic foundation models by integrating multi-modal and multi-examinations. Unlike conventional models that focus on specific examination types, EyeCLIP's comprehensive approach is more suited for the varied and complex presentations in clinical settings. The model's integration of clinical text enhances its diagnostic capabilities, especially in zero-shot scenarios, making it highly valuable in resource-limited contexts.
EyeCLIP's advancements in systemic disease prediction emphasize the expanding role of ophthalmic imaging in general health monitoring. The model's proficiency in retrieving and interpreting data across modalities is a significant step towards more automated and precise medical care.
Limitations and Future Work
Despite its strengths, EyeCLIP's performance is tied to the quality and scope of its training data. Future work should focus on diversifying the training datasets to include broader demographic and clinical variations. Additionally, standardizing textual data in clinical reports could further enhance model performance. Practical and ethical considerations will be crucial for real-world deployment, including ensuring model transparency and interpretability for healthcare providers.
Conclusion
EyeCLIP represents a significant contribution to ophthalmic AI by effectively utilizing visual and textual data for comprehensive multi-modal image analysis. Its remarkable performance across various ophthalmic and systemic disease tasks highlights its potential as an invaluable tool in clinical practice and medical research. The methodologies developed in EyeCLIP offer valuable insights that can inform the creation of foundation models in other medical fields, extending the potential benefits of AI in healthcare.