Retrieval-augmented in-context learning for multimodal large language models in disease classification (2505.02087v1)

Published 4 May 2025 in cs.AI

Abstract: Objectives: We aim to dynamically retrieve informative demonstrations, enhancing in-context learning in multimodal LLMs (MLLMs) for disease classification. Methods: We propose a Retrieval-Augmented In-Context Learning (RAICL) framework, which integrates retrieval-augmented generation (RAG) and in-context learning (ICL) to adaptively select demonstrations with similar disease patterns, enabling more effective ICL in MLLMs. Specifically, RAICL examines embeddings from diverse encoders, including ResNet, BERT, BioBERT, and ClinicalBERT, to retrieve appropriate demonstrations, and constructs conversational prompts optimized for ICL. We evaluated the framework on two real-world multi-modal datasets (TCGA and IU Chest X-ray), assessing its performance across multiple MLLMs (Qwen, Llava, Gemma), embedding strategies, similarity metrics, and varying numbers of demonstrations. Results: RAICL consistently improved classification performance. Accuracy increased from 0.7854 to 0.8368 on TCGA and from 0.7924 to 0.8658 on IU Chest X-ray. Multi-modal inputs outperformed single-modal ones, with text-only inputs being stronger than images alone. The richness of information embedded in each modality will determine which embedding model can be used to get better results. Few-shot experiments showed that increasing the number of retrieved examples further enhanced performance. Across different similarity metrics, Euclidean distance achieved the highest accuracy while cosine similarity yielded better macro-F1 scores. RAICL demonstrated consistent improvements across various MLLMs, confirming its robustness and versatility. Conclusions: RAICL provides an efficient and scalable approach to enhance in-context learning in MLLMs for multimodal disease classification.

Summary

Retrieval-Augmented In-Context Learning for Multimodal LLMs in Disease Classification

The paper "Retrieval-augmented in-context learning for multimodal LLMs in disease classification" presents an innovative framework aiming to enhance the capabilities of multimodal LLMs (MLLMs) in medical disease classification. The proposed method, termed the Retrieval-Augmented In-Context Learning (RAICL) framework, seeks to dynamically select informative demonstrations to improve the in-context learning (ICL) process of MLLMs. This integration utilizes retrieval-augmented generation (RAG) techniques to effectively adapt and refine disease pattern analysis by leveraging more relevant contextual information.

Methodological Framework

RAICL innovatively combines RAG with ICL by exploring various embedding strategies and similarity metrics. This includes popular encoders such as ResNet for image embeddings and BERT, BioBERT, and ClinicalBERT for textual embeddings. The framework assesses semantic similarities to select the most relevant examples from datasets, thereby optimizing the prompt construction for MLLMs. By employing a synthesizing approach to integrate these chosen examples into conversational prompts, RAICL enhances the discrimination capabilities of MLLMs.

Dataset and Evaluation

The paper evaluates RAICL on two distinct, real-world multi-modal datasets: The Cancer Genome Atlas (TCGA) and the IU Chest X-ray dataset. These datasets represent diverse, clinically significant evaluation scenarios, including histopathological and radiological imaging paired with textual clinical descriptions. The performance was assessed across various state-of-the-art MLLMs such as Qwen, Llava, and Gemma, demonstrating robust improvements in disease classification metrics.

Key Results

RAICL significantly augments classification accuracy, achieving an increase from 0.7854 to 0.8368 on the TCGA dataset and from 0.7924 to 0.8658 on the IU Chest X-ray dataset. A noteworthy finding is the differential impact of modality inputs: text-only inputs surpass image-only inputs in performance, although combining modalities results in superior outcomes. This denotes the critical importance of multimodal integration in clinical contexts.

The framework also improves model performance under few-shot learning conditions, emphasizing its adaptability and potential for generalized applications where labeled data is sparse. Consistent improvements across different similarity metrics reinforce the robustness and versatility of RAICL in varying clinical scenarios.

Implications and Future Directions

The RAICL framework offers substantial implications for both clinical and computational domains. Practically, it presents a scalable solution for enhancing AI-assisted diagnostic tools, especially beneficial in resource-constrained environments where labeled clinical data is limited. Theoretically, RAICL advances the discourse on the synergy between multimodal data and LLMs, illustrating the profound impact of contextually relevant data retrieval on model efficacy.

Future research might focus on extending RAICL's application to other complex, multimodal datasets such as those involving time-based or more intricate biomedical signals. Further, exploring more sophisticated similarity metrics and embedding methods could drive additional refinements. The improvement of computational efficiency, while maintaining high-level accuracy, remains a critical area, especially for application at the point of care in clinical settings.

In conclusion, this paper delineates a substantive advancement in the deployment of multimodal LLMs for medical disease classification, potentially influencing both the evolution of AI technology and its application in enhancing healthcare outcomes.