Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis (2310.09909v3)

Published 15 Oct 2023 in cs.CV and cs.CL

Abstract: Driven by the large foundation models, the development of artificial intelligence has witnessed tremendous progress lately, leading to a surge of general interest from the public. In this study, we aim to assess the performance of OpenAI's newest model, GPT-4V(ision), specifically in the realm of multimodal medical diagnosis. Our evaluation encompasses 17 human body systems, including Central Nervous System, Head and Neck, Cardiac, Chest, Hematology, Hepatobiliary, Gastrointestinal, Urogenital, Gynecology, Obstetrics, Breast, Musculoskeletal, Spine, Vascular, Oncology, Trauma, Pediatrics, with images taken from 8 modalities used in daily clinic routine, e.g., X-ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), Digital Subtraction Angiography (DSA), Mammography, Ultrasound, and Pathology. We probe the GPT-4V's ability on multiple clinical tasks with or without patent history provided, including imaging modality and anatomy recognition, disease diagnosis, report generation, disease localisation. Our observation shows that, while GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy, it faces significant challenges in disease diagnosis and generating comprehensive reports. These findings underscore that while large multimodal models have made significant advancements in computer vision and natural language processing, it remains far from being used to effectively support real-world medical applications and clinical decision-making. All images used in this report can be found in https://github.com/chaoyi-wu/GPT-4V_Medical_Evaluation.

PDF Abstract

GPT-4V for Multimodal Medical Diagnosis

The paper, "Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis," examines the potential application of OpenAI's GPT-4V(ision) in medical diagnosis tasks across diverse imaging modalities and anatomical systems. This investigation evaluates the effectiveness of GPT-4V in a clinical setting, primarily focusing on five clinical tasks: imaging modality and anatomy recognition, disease diagnosis, report generation, disease localization, and patient history integration.

Evaluation of GPT-4V's Core Competencies

Imaging Modality Identification: GPT-4V demonstrates proficiency in identifying imaging modalities such as X-ray, CT, and MRI. This competence extends to distinguishing anatomical structures across an extensive range of body systems, from the central nervous system to musculoskeletal regions. The model's ability to correctly identify imaging planes further indicates a substantial understanding of medical imaging basics.
Report Generation: While GPT-4V generates structured reports consistently, its observations are frequently generic and lack specific insights necessary for detailed medical evaluations. The reports often include a standard template, but the content might not adequately reflect complex pathologies.
Disease Diagnosis: The model struggles significantly with accurate disease detection and diagnosis. Although GPT-4V can list potential diseases when prompted, it frequently defaults to conservative estimates that fail to pinpoint exact abnormalities identified by medical experts. This highlights a critical limitation in its diagnostic capacity, underscoring the gap between GPT-4V's outputs and expert diagnostic practices.
Disease Localization: The capability to localize abnormalities or anatomical structures within medical images remains underdeveloped. Despite several attempts, GPT-4V's bounding box predictions exhibit high variance and are inconsistent, resulting in low intersection-over-union (IOU) scores compared to ground-truth segmentation data.
Patient History Integration: Including patient history often aids GPT-4V in more targeted analysis, suggesting that text prompts containing context-rich patient data may improve diagnostic accuracy moderately. This sensitivity to detailed prompts, while useful, also indicates that the model relies heavily on these inputs instead of image-based evidence.

Implications and Future Directions

The findings clarify that while GPT-4V can act as a supportive tool in identifying modalities and producing structured output, its diagnostic utility is limited. This underscores the necessity for further research and model refinement, particularly focusing on enhancing the model's ability to interpret and correlate visual and textual data accurately.

Future work might explore:

Advanced Training:

Incorporating more specialized datasets and enhancing multimodal learning frameworks could improve disease detection capabilities.

Integration with Clinical Systems:

Developing plug-in functionalities that allow seamless integration with clinical decision systems might provide medical professionals with enhanced diagnostic support tools.

Safety and Regulatory Compliance:

Addressing safety concerns and ensuring models meet stringent regulatory standards is crucial before broader clinical application.

This paper advocates caution in deploying GPT-4V for real-world medical applications as it currently stands. However, its capabilities in structured report generation and recognition of imaging modalities are promising preliminary steps. Continued development could pave the way for reliable multimodal AI systems in healthcare.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Chaoyi Wu (24 papers)
Jiayu Lei (6 papers)
Qiaoyu Zheng (4 papers)
Weike Zhao (6 papers)
Weixiong Lin (10 papers)
Xiaoman Zhang (31 papers)
Xiao Zhou (83 papers)
Ziheng Zhao (11 papers)
Ya Zhang (222 papers)
Yanfeng Wang (211 papers)
Weidi Xie (132 papers)

Citations (64)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - chaoyi-wu/GPT-4V_Medical_Evaluation (41 stars)

Tweets

https://twitter.com/WeidiXie/status/1797623486682407147