Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 75 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 431 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining (2501.15798v1)

Published 27 Jan 2025 in cs.CV

Abstract: Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.

Summary

The paper introduces a novel MM-Retinal V2 dataset comprising CFP, FFA, and OCT images with expert annotations for fundus analysis.
It presents KeepFIT V2, a pretraining model that fuses contrastive and generative learning to effectively inject hybrid image-text knowledge.
Experimental evaluations across zero-shot, few-shot, and linear probing settings demonstrate competitive performance on multiple ophthalmology benchmarks.

Overview of "MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining"

The paper "MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining" proposes significant advancements in the pretraining of vision-LLMs (VLPs) for fundus image analysis, addressing the limitations inherent in current methods that depend on large-scale, private image-text data. The authors introduce MM-Retinal V2, a high-quality image-text paired dataset that incorporates multiple imaging modalities relevant to fundus analysis, namely color fundus photography (CFP), fundus fluorescein angiography (FFA), and optical coherence tomography (OCT).

Data Acquisition and Dataset Construction

MM-Retinal V2 is distinctive due to its inclusion of approximately 5,000 image-text pairs for each of the CFP, FFA, and OCT modalities, capturing a wide range of over 96 fundus diseases and abnormalities. The dataset is curated through a semi-automated pipeline for the CFP and FFA modalities and via expert assessments for the OCT modality, ensuring high-quality annotations. Additionally, the dataset is complemented by the MM-Retinal-Text component, which aggregates ophthalmology-related texts, serving to enhance the text encoder's capacity to understand domain-specific vocabularies and contexts.

KeepFIT V2: A Novel Fundus Vision-Language Pretraining Model

The authors introduce KeepFIT V2, a fundus vision-language pretraining model that leverages the MM-Retinal V2 dataset alongside public categorical datasets with a novel knowledge transfer approach. This pretraining methodology consists of two key components:

Preliminary Textual Knowledge Pretraining: The authors emphasize enriching the text encoder with extensive medical knowledge through pretraining with MM-Retinal-Text, improving its proficiency in dealing with complex medical terminologies.
Hybrid Image-Text Knowledge Injection: KeepFIT V2 integrates image and text knowledge through a complementary method that combines semantic representation derived from contrastive learning with detailed appearance features obtained from generative learning. This dual knowledge extraction facilitates comprehensive representation learning and effective knowledge transfer from MM-Retinal V2 to public datasets.

Experimental Evaluations and Comparative Analysis

Extensive experiments conducted across multiple settings—zero-shot, few-shot, and linear probing—demonstrate KeepFIT V2’s competitive performance compared to state-of-the-art VLP models. Deployed on various data validations, including REFUGE, ODIR200×3, Retina, and iChallenge-AMD datasets for CFP modality, as well as datasets specific to FFA and OCT modalities, KeepFIT V2 consistently outperforms or matches models pretrained on significantly larger private datasets.

Implications and Future Directions

The implications of this research are substantial in both theoretical and practical realms. The integration of smaller, high-quality datasets into the training regime of fundus VLPs exemplifies an effective strategy for overcoming data scarcity issues commonly encountered in medical image analysis. Moreover, by publicly releasing the MM-Retinal V2 dataset and the pre-trained models, the authors contribute valuable resources to the research community.

This work opens up possibilities for future developments, such as extending similar pretraining strategies to other domains of medical imaging where large-scale image-text paired data may not be readily available. The hybrid knowledge injection framework may also be adapted or expanded to leverage new forms of medical data, such as wearable health-monitoring imagery or multi-organ diagnostic datasets.

In conclusion, this paper advances the field of fundus VLPs by introducing a data-efficient approach that does not rely on resources from large-scale proprietary datasets, thereby democratizing the availability and application of advanced diagnostic tools in ophthalmology and potentially beyond.