Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Foundation Models Exploit Text to Make Medical Image Predictions (2311.05591v2)

Published 9 Nov 2023 in cs.CV, cs.AI, and cs.CL

Abstract: Multimodal foundation models have shown compelling but conflicting performance in medical image interpretation. However, the mechanisms by which these models integrate and prioritize different data modalities, including images and text, remain poorly understood. Here, using a diverse collection of 1014 multimodal medical cases, we evaluate the unimodal and multimodal image interpretation abilities of proprietary (GPT-4, Gemini Pro 1.0) and open-source (Llama-3.2-90B, LLaVA-Med-v1.5) multimodal foundational models with and without the use of text descriptions. Across all models, image predictions were largely driven by exploiting text, with accuracy increasing monotonically with the amount of informative text. By contrast, human performance on medical image interpretation did not improve with informative text. Exploitation of text is a double-edged sword; we show that even mild suggestions of an incorrect diagnosis in text diminishes image-based classification, reducing performance dramatically in cases the model could previously answer with images alone. Finally, we conducted a physician evaluation of model performance on long-form medical cases, finding that the provision of images either reduced or had no effect on model performance when text is already highly informative. Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by their exploitation of text.

Assessment of GPT-4V in Diagnostic Accuracy on Complex Medical Cases

The paper conducted by Buckley et al. provides a comprehensive analysis of the diagnostic performance of a vision-LLM, GPT-4V, across a variety of challenging medical cases presented in the New England Journal of Medicine (NEJM) Image Challenge. The model was evaluated against human respondents in terms of accuracy, stratified by case difficulty, image type, and patient demographics, including skin tone. Additionally, the model's performance was assessed in clinicopathological conferences (CPCs) to further gauge its diagnostic reasoning capabilities.

Key Findings

GPT-4V demonstrated superior performance compared to human respondents, achieving an overall accuracy of 61% versus 49% for human participants. This advantage persisted across different levels of case difficulty and disagreement among human respondents, as well as across various skin tones and image types, except in the domain of radiographic images, where human and GPT-4V accuracy were equivalent.

Multimodal Versus Unimodal Performance

The analysis uncovered notable differences in performance when GPT-4V utilized text, images, or both. The model showed enhanced accuracy when leveraging both text and images, particularly as the richness of the textual information increased. However, in cases with highly informative text, the addition of images paradoxically led to a slight decrease in accuracy. This effect was particularly evident in the CPCs, where models using text alone outperformed those using both text and images.

Assessment of Image and Skin Tone Impact

Particularly insightful was the paper's exploration of performance across different image types and skin tones. GPT-4V consistently surpassed human accuracy for cutaneous, oral, and ocular images. The model's accuracy was consistent across varying skin tones, with no significant disparities in performance.

Theoretical and Practical Implications

This paper advances the understanding of how multimodal AI models can be employed in medical diagnostic reasoning. The results highlight the importance of contextual information, suggesting that AI models may benefit from detailed textual data to make accurate diagnoses. On the practical front, although the model's capacity to incorporate multimodal inputs is promising, its current limitations, particularly related to the integration of images in text-rich cases, must be acknowledged and addressed.

Future Directions

Future research could focus on refining the capabilities of models like GPT-4V to better handle and integrate diverse data inputs without experiencing performance dilution. Additionally, understanding the dynamics of AI-assisted diagnosis and its influence on human decision-making remains crucial for improving the symbiosis between AI systems and medical practitioners.

In summary, GPT-4V demonstrates robust potential in medical diagnostics, although further advancements and studies are needed to optimize its integration and application in clinical settings. This paper contributes significantly to the ongoing discourse on the role of multimodal AI in enhancing diagnostic accuracy, underscoring the potential impact on healthcare delivery and clinical decision-making processes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Thomas Buckley (1 paper)
  2. James A. Diao (3 papers)
  3. Adam Rodman (6 papers)
  4. Arjun K. Manrai (5 papers)
  5. Pranav Rajpurkar (69 papers)
Citations (15)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com