The paper "RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-LLMs" introduces RetinalGPT, a novel multimodal conversational assistant designed for the quantitative analysis of retinal images using Multimodal LLMs (MLLMs). The paper addresses the limitations of general-domain MLLMs in performing specialized tasks such as the interpretation of retinal images, crucial for diagnosing ocular diseases. The authors highlight the gap between general-domain and medical-domain MLLMs and propose RetinalGPT to bridge this gap by enhancing retinal disease diagnosis capabilities.
Key Contributions:
- Retinal-Specific Dataset and Pipeline:
- The paper details the creation of a large, diverse dataset of approximately 38,000 retinal images. This dataset is enriched with disease labels, lesion bounding boxes, and vascular features. The data pipeline includes clinical data extraction using tools like AutoMorph for fractal analysis of retinal vascular structures, assigning clinically meaningful features to each image.
- Instruction Tuning and Training:
- RetinalGPT employs a customized visual instruction tuning method to enhance its retinal analysis capabilities. By employing a two-stage training strategy, the model aligns generic-domain VLMs to be effective in retinal domain tasks while preserving broader biomedical knowledge:
- Stage 1 (Feature Alignment): Mixup of retinal-specific and general biomedical datasets is used to tune the model, maintaining generic medical domain knowledge.
- Stage 2 (Mixup Instruction-Tuning): Fine-tuning on a mixed dataset combining retinal-specific instruction data with generic medical data helps retain general medical understanding alongside retinal-specific capabilities.
- RetinalGPT employs a customized visual instruction tuning method to enhance its retinal analysis capabilities. By employing a two-stage training strategy, the model aligns generic-domain VLMs to be effective in retinal domain tasks while preserving broader biomedical knowledge:
- Performance and Evaluation:
- RetinalGPT is evaluated against several state-of-the-art models on eight benchmark datasets covering multiple ophthalmic diseases. It demonstrates superior performance, particularly in disease diagnosis, lesion localization, and vascular structure analysis.
- Results showcase RetinalGPT's successful lesion localization capability, predicting lesion bounding boxes with high accuracy compared to ground truth annotations. It also accurately estimates vascular feature values, validating the precision of its analysis.
- Generalization to Generic Medical Domain:
- The model's response similarity to the LLaVA-Med model when tested on generic medical questions indicates that RetinalGPT preserves knowledge beyond the retinal domain. This demonstrates its extensive potential applicability in broader medical imaging contexts.
Conclusion:
RetinalGPT marks a significant advancement in retinal image analysis by leveraging large-scale multimodal models to improve clinical diagnostics' quantitative and interpretative dimensions. It stands out for its capability to integrate both extensive biomedical domain knowledge and focused retinal domain expertise to facilitate detailed and interpretable end-to-end clinical frameworks. The authors note the model's limitation regarding modality-centric initial responses and plan to address this in future work to enhance conversational dynamics.