Capabilities of GPT-5 on Multimodal Medical Reasoning (2508.08224v2)

Published 11 Aug 2025 in cs.CL and cs.AI

Abstract: Recent advances in LLMs have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper demonstrates unprecedented improvements in text and multimodal medical reasoning, with GPT-5 achieving up to 95.84% accuracy and significant gains on MedXpertQA.
The study employs a unified zero-shot chain-of-thought prompting method to integrate heterogeneous data sources, including clinical text, structured indicators, and medical images.
The paper highlights GPT-5’s super-human performance on multimodal tasks, surpassing pre-licensed human experts and setting new benchmarks for clinical decision support.

Evaluation of GPT-5 for Multimodal Medical Reasoning

Introduction

The paper presents a comprehensive evaluation of GPT-5 and its variants (GPT-5-mini, GPT-5-nano) on multimodal medical reasoning tasks, benchmarking their performance against GPT-4o-2024-11-20 and pre-licensed human experts. The paper addresses the critical challenge of integrating heterogeneous medical data—textual narratives, structured indicators, and medical images—within a unified reasoning framework. The authors employ standardized zero-shot chain-of-thought (CoT) prompting across diverse datasets, including MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD, to isolate model improvements from prompt engineering or dataset idiosyncrasies.

Datasets and Evaluation Protocol

The evaluation spans both text-based and multimodal medical QA/VQA datasets:

MedQA: Multiple-choice questions from US, Mainland China, and Taiwan medical licensing exams.
MMLU-Medical: Subset of MMLU focused on medical knowledge and reasoning.
USMLE Self Assessment: Official practice questions for Steps 1, 2 CK, and 3.
MedXpertQA: Expert-level benchmark with text-only and multimodal subsets, the latter incorporating complex clinical images and patient records.
VQA-RAD: Radiology-focused VQA dataset with binary yes/no questions linked to curated clinical images.

The unified prompting protocol utilizes zero-shot CoT reasoning, with explicit step-by-step rationale generation followed by a discrete answer selection. For multimodal items, images are appended to the initial user message, enabling integrated vision-language reasoning.

Figure 1: A prompting design sample from MedXpertQA, illustrating the integration of clinical text and medical imaging in the input.

Results: Text-Based Medical Reasoning

GPT-5 demonstrates consistent and substantial improvements over GPT-4o and its own smaller variants across all text-based benchmarks. On MedQA (US 4-option), GPT-5 achieves 95.84% accuracy, a 4.80% absolute gain over GPT-4o. The most pronounced improvements are observed in MedXpertQA Text, with reasoning and understanding scores increasing by 26.33% and 25.30%, respectively. In MMLU medical subdomains, GPT-5 maintains near-ceiling performance ( $>$ 91%), with incremental gains in high-baseline categories, indicating that the model's upgrades primarily benefit complex reasoning tasks rather than factual recall.

Results: USMLE Self Assessment

GPT-5 outperforms all baselines on USMLE Steps 1, 2, and 3, with the largest margin (+4.17%) on Step 2, which emphasizes clinical decision-making. The average score across steps is 95.22%, exceeding typical human passing thresholds and demonstrating readiness for high-stakes clinical reasoning.

Results: Multimodal Medical Reasoning

GPT-5 achieves dramatic improvements in multimodal reasoning, particularly on MedXpertQA MM, with reasoning and understanding gains of +29.26% and +26.18% over GPT-4o. This magnitude of improvement suggests enhanced cross-modal attention and alignment within the model architecture. Notably, GPT-5 surpasses pre-licensed human experts by +24.23% (reasoning) and +29.40% (understanding) on MedXpertQA MM, marking a shift from human-comparable to super-human performance.

A representative case from MedXpertQA MM demonstrates GPT-5's ability to synthesize clinical narratives, laboratory data, and imaging findings to recommend appropriate high-stakes interventions.

Figure 2: GPT-5 reasoning output and final answer for MedXpertQA: case MM-1993, showing stepwise integration of multimodal evidence and exclusion of incorrect options.

In contrast, GPT-5 scores slightly lower on VQA-RAD (70.92%) compared to GPT-5-mini (74.90%), possibly reflecting conservative reasoning calibration in the larger model for small-domain tasks.

Comparison with Human Experts

GPT-5 not only closes the performance gap with pre-licensed human experts but exceeds their scores by substantial margins in both text and multimodal settings. GPT-4o remains below human expert performance in most dimensions, underperforming by 5.03–15.90%. The magnitude of GPT-5's lead is most pronounced in multimodal reasoning, where its unified vision-language pipeline delivers integration of textual and visual evidence that surpasses experienced clinicians under time-limited test conditions.

Discussion

The evaluation reveals several key findings:

Substantial Gains in Multimodal Reasoning: GPT-5's improvements are most pronounced in tasks requiring tight integration of image-derived and textual evidence, suggesting architectural or training enhancements in cross-modal attention.
Strength in Reasoning-Intensive Tasks: Chain-of-thought prompting synergizes with GPT-5's internal reasoning capacity, enabling more accurate multi-hop inference, especially in complex clinical scenarios.
Super-Human Benchmark Performance: GPT-5 consistently exceeds pre-licensed human expert performance in controlled QA/VQA evaluations, highlighting its potential for clinical decision support. However, these results are obtained under idealized testing conditions and may not fully capture the complexity and uncertainty of real-world medical practice.
Scaling-Related Calibration Effects: The slight underperformance of GPT-5 on VQA-RAD compared to GPT-5-mini suggests that larger models may adopt more cautious reasoning strategies in small-domain tasks, warranting further investigation into adaptive prompting and calibration techniques.

Implications and Future Directions

The demonstrated capabilities of GPT-5 have significant implications for the design of future clinical decision-support systems. Its proficiency in integrating complex multimodal information streams and delivering accurate, well-justified recommendations positions it as a reliable core component for medical AI applications. However, the transition from benchmark evaluations to real-world deployment necessitates further research into prospective clinical trials, domain-adapted fine-tuning, and robust calibration methods to ensure safety, transparency, and ethical compliance.

Conclusion

This paper provides a rigorous, longitudinal evaluation of GPT-5's capabilities in multimodal medical reasoning, establishing its superiority over GPT-4o, smaller GPT-5 variants, and pre-licensed human experts across diverse QA and VQA benchmarks. The model's substantial gains in reasoning-intensive and multimodal tasks mark a qualitative shift in LLM capabilities, bridging the gap between research prototypes and practical clinical tools. Future work should focus on validating these results in real-world clinical environments and developing strategies for safe and effective deployment.

PDF Markdown

Follow-up Questions

Related Papers

Authors (5)

Tweets

https://twitter.com/Dr_Singularity/status/1956064955821654024

https://twitter.com/LangChainJP/status/1955962832735899910

https://twitter.com/16uatWhpXTk7QB8/status/1957208920382165085

https://twitter.com/bohannon_bot/status/1955682810523771341

https://twitter.com/theomitsa/status/1956417934714454033

https://twitter.com/WisdomsWaves/status/1957136783466197437