MedCLIP: Contrastive Learning from Unpaired Medical Images and Text (2210.10163v1)

Published 18 Oct 2022 in cs.CV and cs.CL

Abstract: Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using around 200K data). Our code is available at https://github.com/RyanWangZf/MedCLIP.

PDF Abstract

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

The paper discusses MedCLIP, an extension of vision-text contrastive learning to the medical domain, aimed at tackling the challenges posed by insufficient paired medical image-text datasets. The proposed method circumvents the limitation of needing extensive paired datasets by decoupling medical images and reports, allowing the usage of a broader spectrum of data sources including image-only and text-only datasets.

Key Methodological Contributions

Decoupling of Image-Text Pairs: Unlike prior approaches that rely solely on paired data, MedCLIP leverages unpaired images and texts. This decoupling increases the effective dataset size significantly, thereby enhancing the pre-training efficiency without requiring additional data collection or annotation efforts.
Semantic Matching Loss: MedCLIP replaces the conventional InfoNCE loss with a semantic matching loss. This modification utilizes medical knowledge to discern true semantic similarities between different patient data, thus reducing the false negatives typically introduced in contrastive learning scenarios. The semantic alignment between images and texts is achieved by extracting medical entities from reports and comparing them to labeled image data.

Experimental Evaluation

MedCLIP demonstrates superior performance against state-of-the-art models such as GLoRIA and ConVIRT across several benchmarks including:

Zero-Shot Prediction: MedCLIP shows impressive data efficiency, achieving better performance with significantly fewer pre-training data compared to counterparts. Noteworthy is its ability to accurately classify COVID-19 related images without any direct exposure during pre-training, indicating strong domain transferability.
Supervised Classification: The model outperforms others in downstream medical image classification tasks, highlighting the robustness of the learned representations.
Image-Text Retrieval: In retrieval tasks, MedCLIP attains higher precision metrics, suggesting that it captures richer semantic embeddings.

Implications and Future Directions

The approach presents significant implications for medical AI, primarily enhancing adaptability and efficiency of pre-trained models with limited labeled data. By making efficient use of unpaired data and reducing dependency on annotated datasets, MedCLIP supports the development of generalized models for a variety of clinical tasks, reducing the barrier to entry for innovative applications in healthcare AI.

Future work could explore further enhancements to the embedding space to address anisotropic distributions, potentially improving precision in retrieval tasks. Additionally, integrating advanced prompt-learning techniques can automate zero-shot inferences, thus increasing applicability across a wider range of clinical settings.

In summary, MedCLIP offers a methodologically robust framework that pushes the boundaries of contrastive learning in the medical field by leveraging unpaired datasets and reducing false negative impacts, paving the way for more data-efficient and versatile AI solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Zifeng Wang (78 papers)
Zhenbang Wu (6 papers)
Dinesh Agarwal (1 paper)
Jimeng Sun (181 papers)

Citations (295)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - RyanWangZf/MedCLIP: EMNLP'22 | MedCLIP: Contrastive Learning from Unpaired Medical Images and Texts (393 stars)