Contrastive Learning of Medical Visual Representations from Paired Images and Text (2010.00747v2)

Published 2 Oct 2020 in cs.CV, cs.CL, and cs.LG

Abstract: Learning visual representations of medical images (e.g., X-rays) is core to medical image understanding but its progress has been held back by the scarcity of human annotations. Existing work commonly relies on fine-tuning weights transferred from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. Meanwhile, several recent studies show exciting results from unsupervised contrastive learning from natural images, but we find these methods help little on medical images because of their high inter-class similarity. We propose ConVIRT, an alternative unsupervised strategy to learn medical visual representations by exploiting naturally occurring paired descriptive text. Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test ConVIRT by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that it leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10\% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

PDF Abstract

Contrastive Learning of Medical Visual Representations from Paired Images and Text

The paper presents a novel approach called ConVIRT for learning medical visual representations utilizing contrastive learning techniques applied to paired medical images and their descriptive textual reports. This method aims to address the challenges in medical image understanding due to the scarcity of annotated data, which is essential for training robust machine learning models, particularly in healthcare.

Methodology

ConVIRT leverages naturally occurring paired data, such as chest X-rays and associated radiology reports. The approach maximizes agreement between image representations and their corresponding text descriptors through a bidirectional contrastive objective. Several key aspects distinguish ConVIRT:

Bidirectional Contrastive Learning: The method contrasts image and text pairs, encouraging the embeddings to reflect their intrinsic association. This is different from image-only contrastive methods that show limited benefit due to high inter-class similarity in medical images.
Architecture: The framework uses a ResNet50 architecture for image encoding and a BERT-based text encoder, providing a domain-agnostic approach without additional expert input requirements.
Data Efficiency: A notable claim of ConVIRT is that it requires only 10% of the labeled training data to achieve performance comparable or superior to models initialized with ImageNet pretraining.

Experimental Results

The authors evaluate the ConVIRT pretraining strategy on four medical image classification tasks and report improvements consistently across all settings:

Classification Tasks: On tasks like RSNA Pneumonia Detection and CheXpert, ConVIRT surpasses traditional methods. It shows robustness in both linear classification and fine-tuning settings, demonstrating significant improvements with reduced labeled data.
Retrieval Tasks: Zero-shot image-image and text-image retrieval tasks also reveal the capability of ConVIRT to produce superior image representations. The retrieved results align closely with human-annotated datasets, confirming the quality of learned representations.

Implications and Future Directions

The implications of ConVIRT are significant for the healthcare domain, where labeled data is costly and scarce. The framework’s reliance on multimodal data naturally occurring in clinical settings offers a pathway to substantially reduce annotation costs while maintaining or enhancing model performance.

Furthermore, this work has inspired larger scale studies, such as the CLIP and ALIGN models, demonstrating its adaptability beyond medical imaging alone. Future research directions could explore extending ConVIRT to other healthcare data modalities, like genetic data or patient history metadata, to further enhance predictive capabilities across diverse medical tasks.

In summary, ConVIRT represents a practical step towards more efficient machine learning in healthcare by leveraging the synergy between image and text data. This approach holds promise for widespread applicability and further exploration into cross-modality learning frameworks in healthcare and beyond.