CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning (2407.21011v1)

Published 30 Jul 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Recent advancements in Contrastive Language-Image Pre-training (CLIP) have demonstrated notable success in self-supervised representation learning across various tasks. However, the existing CLIP-like approaches often demand extensive GPU resources and prolonged training times due to the considerable size of the model and dataset, making them poor for medical applications, in which large datasets are not always common. Meanwhile, the LLM prompts are mainly manually derived from labels tied to images, potentially overlooking the richness of information within training samples. We introduce a novel language-image Contrastive Learning method with an Efficient LLM and prompt Fine-Tuning (CLEFT) that harnesses the strengths of the extensive pre-trained language and visual models. Furthermore, we present an efficient strategy for learning context-based prompts that mitigates the gap between informative clinical diagnostic data and simple class labels. Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets compared with various baselines. The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable LLM to only 4% compared with the current BERT encoder.

Authors (3)

Yuexi Du (11 papers)
Brian Chang (6 papers)
Nicha C. Dvornek (41 papers)

Citations (1)

View on Semantic Scholar

Summary

Overview of CLEFT: Language-Image Contrastive Learning with Efficient LLM and Prompt Fine-Tuning

The paper introduces an innovative approach to language-image contrastive learning, dubbed CLEFT (Contrastive Learning with Efficient LLM and prompt Fine-Tuning). This methodology seeks to enhance the capabilities of existing contrastive language-image pre-training (CLIP) models, focusing primarily on their application in the medical domain. Given the constraints of medical datasets, in terms of both size and accessibility, CLEFT leverages advancements in pre-trained models and prompt fine-tuning to optimize performance and efficiency.

The authors emphasize the need for more efficient models that do not sacrifice performance despite limited data availability, which is a common issue in medical applications. Traditional CLIP models, while effective for natural image-text pairs, often require copious amounts of data and computational resources, rendering them inefficient for the medical field. CLEFT addresses these limitations by incorporating a pre-trained LLM and an efficient strategy for learning context-based prompts.

The paper demonstrates that CLEFT achieves state-of-the-art results across various chest X-ray and mammography datasets with significant reductions in the model size and training requirements. Specifically, the framework reduces the total trainable model size by 39% and the trainable LLM to a mere 4% compared to models using a BERT encoder. This is particularly substantial in light of the computational constraints present in the healthcare industry.

Methodological Innovations

The core of the CLEFT framework involves combining contrastive learning with efficient fine-tuning strategies:

Integration with LLMs: CLEFT utilizes a large, pre-trained LLM as the text encoder. By doing so, it capitalizes on the robust feature space these models can offer, particularly LLMs such as GPT-2, which are known for their capacity to generalize effectively even without extensive re-training.
Parameter-Efficient Fine-Tuning (PEFT): To mitigate the risks of overfitting on limited medical datasets, CLEFT employs PEFT techniques, which fine-tune only a small number of additional parameters within each transformer block rather than adjusting the entire model. This conservation of the broader knowledge embedded in the model allows for better generalization to unseen data with minimal resource expenditure.
Context-Based Prompt Fine-Tuning: A second phase of training focuses on optimizing a series of trainable prompt tokens. By adapting these prompts during classification tasks, CLEFT advances beyond static, handcrafted prompts, enabling the model to generalize more robustly across different tasks and data subsets.

Experimental Results and Implications

The empirical evaluations on datasets like CheXpert and RSNA underscore the efficacy of the CLEFT approach. Noteworthy is the model's superior performance in zero-shot, linear probing, and full fine-tuning scenarios, exhibiting substantial improvements over existing methods such as MedCLIP and MGCA. Furthermore, the results highlight a significant reduction in computational overhead, positioning CLEFT as a practical solution for real-world applications where computational resources are at a premium.

Future Prospects

The implications of CLEFT extend beyond immediate gains in performance and efficiency. Its ability to functioning effectively with limited labels and data points, coupled with its inherent adaptability to diverse medical imaging datasets, points towards broader applications across other constrained domains. It also opens avenues for further research into integrating ever-larger LLMs into specialized domains through innovative adaptation techniques.

In conclusion, the development of CLEFT presents a meaningful step forward in the adaptation of multimodal learning models to specialized fields like medical imaging. Its emphasis on efficiency without sacrificing performance makes it a valuable addition to the toolkit of researchers and practitioners operating within data and resource-constrained environments.

PDF Markdown

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos