MedXpertQA-Multi Clinical QA Benchmark

Updated 17 November 2025

MedXpertQA-Multi is a challenging multimodal benchmark that evaluates advanced clinical reasoning using integrated patient narratives, images, and structured data.
It employs rigorous AI and human expert filtering with modern LLM rewriting to ensure high difficulty and minimize memorization risk.
Benchmark results highlight significant gaps in current models, emphasizing the need for modular architectures and enhanced uncertainty calibration.

MedXpertQA-Multi is a high-difficulty, multimodal benchmark and clinical QA testbed that evaluates expert-level reasoning capabilities of both large models and hybrid human–AI systems in medicine. It is specifically engineered to assess advanced clinical judgment, multi-step reasoning, and integration of heterogeneous data types—such as patient narratives, laboratory tables, and diverse imaging modalities—mirroring the challenges faced in real diagnostic workflows.

1. Scope, Dataset Construction, and Filtering

MedXpertQA-Multi originates from MedXpertQA (Zuo et al., 30 Jan 2025), consisting of over 2,000 rigorously filtered multimodal questions out of an initial 37,543 sourced from leading medical exams—USMLE, COMLEX-USA, specialty board exams (17 boards), plus high-image-content datasets (ACR DXIT/TXIT, EDiR, NEJM Image Challenge). Each item incorporates one or more high-resolution images (radiology, pathology, clinical photos, ECGs, tables, diagrams), alongside rich patient histories, vital signs, and lab values.

Filtering occurs in multiple rounds:

AI Expert Filtering: Questions correctly answered by baseline LLMs (basic and advanced, vision-enabled) in repeated sampling are discarded; only items confounding both traditional and advanced AI models are retained, driving dataset difficulty higher than previous benchmarks.
Human Expert Filtering: Board-certified physicians assign difficulty ratings and option distributions, with Brier score $\displaystyle B = \frac{1}{N} \sum_{i=1}^N (y_i - \hat y_i)^2$ (where $\hat y_i$ is the fraction of humans choosing option $i$ ) guiding adaptive, stratified retention.
Data Synthesis: All questions are rewritten using modern LLMs to minimize memorization risk—options are expanded and difficult distractors inserted; text similarity and perplexity metrics (ROUGE-L, edit distance) quantify leakage defense.
Expert Review: Physicians audit every rewritten question and option for factuality, plausibility, and clarity. Images are validated for authenticity and match with diagnostic context.

Multimodal diversity is maximized: the 2,005 question set covers all clinical specialties and body systems, each using precisely five answer options, with associated images spanning ten medical categories.

2. Question & Answer Format and Evaluation Protocols

Each MedXpertQA-Multi question presents a complex clinical vignette: a patient’s history, relevant exam/lab findings, plus one or more diagnostic images with interpretive cues. Prompts demand high-order reasoning, rarely answerable by lookup or pattern-matching. Representative questions ask for:

Differential diagnosis based on visual pathology (e.g., image histology, radiology panels)
Next-step management given structured data and imaging
Interpretation of advanced physiological or mechanical measures (e.g., Poisson ratio from image analysis)

Answer format is strictly multiple-choice (exactly five options for each multimodal item). Evaluation distinguishes "Reasoning" (multi-step inference, unfamiliar conditions, novel cases) from "Understanding" (recognition, simple factual recall) subsets. Metrics include absolute accuracy and composite averages (Reasoning $\mathit{Avg_R}$ , Understanding $\mathit{Avg_K}$ , and modality split).

Benchmarking includes inference-scaled proprietary LLMs, vanilla multimodal models, and pure text models. Top-scoring inference-enhanced LMMs (e.g., o1, DeepSeek-R1) reach only ~50% overall accuracy; vanilla models and open-source baselines remain below 40%, substantially beneath human expert levels (≥80%).

3. Underlying Challenges in Multimodal Medical Reasoning

MedXpertQA-Multi systematically reveals weak points of current systems:

Visual Perception Bottleneck: Models often misread critical diagnostic details in pathology slides, radiologic images, or ECGs, leading to perceptual errors. Image-rich modalities present greater challenges: accuracy is significantly lower on "image-guided" problems versus text-only cases.
Complex Reasoning Failure Modes: Up to 50% of errors stem from incorrect differential diagnosis and multi-step reasoning, particularly when structured knowledge, temporal progression, or cross-modal integration are required.
Knowledge Gaps and Temporal Reasoning: Some errors indicate outdated or incomplete knowledge bases, or misunderstanding of the significance of findings over time.
Calibration and Uncertainty: Many systems exhibit overconfidence on high-risk questions (poor reliability curves), with a lack of explicit uncertainty quantification or risk calibration.

A plausible implication is that further improvement hinges on modular architectures that separately enhance visual perception and symbolic reasoning, with structured clinical contextualization.

4. Hybrid System Integration and Human-AI Workflow Strategies

MedXpertQA-Multi serves as both a benchmark and a target for adaptive, hybrid QA systems. Recent work (Bary et al., 4 Oct 2025) proposes a hybrid coalition-of-experts system for streamlining expert attention in real-time QA:

Incoming queries are processed sequentially; candidate answers are generated via IR/Retrieval or automatic model proposals, then refined by a coalition of human experts with unknown proficiencies.
Expert reliabilities are tracked online via Beta posteriors (with update formula $r_{(k)} = (s_{(k)} + 1) / (t_{(k)} + 2)$ ), dynamically estimated for each medical subdomain.
Querying is adaptive: only the minimal set of experts is sampled per instance, modulated by latent difficulty $d(x)$ (derived, e.g., from model entropy or subdomain uncertainty). Additional experts are queried until the aggregate consensus confidence $c^*_n$ surpasses a user-chosen threshold $\tau$ .
Label aggregation formally weights expert responses by reliability and independent error assumptions, and stopping criteria reflect confidence thresholds, expert agreement, or model certainty.

Empirically, such frameworks reduce expert annotation load by 30–50% at matched answer accuracy, suggesting practical resource savings for screening workflows, triage, and continuous QA pipelines.

5. Benchmark Results: State-of-the-Art Model and Ensemble Performance

Performance on MedXpertQA-Multi reveals system-level strengths and limitations across several axes:

Model	Top-1 Acc. (%)	Δ vs Top Baseline	Reasoning (%)	Understanding (%)
GPT-5	72.18	+29 over GPT-4o	69.99	74.37
GPT-4o-2024-11-20	42.80	baseline	40.73	48.19
Human Experts	45.53	–	45.76	44.97
Sully Consensus (2505.23075)	61.2	+8.2 over O3	~66.1 (diagnosis)	~60.7 (treatment)

GPT-5 demonstrates super-human benchmark performance: +24 to +29 percentage points over pre-licensed human experts on reasoning and understanding metrics (Wang et al., 11 Aug 2025). Ensemble consensus mechanisms, such as Sully MedCon-1, attain 61.2% Top-1 accuracy, outperforming single-model baselines (O3: 53%, Gemini 2.5 Pro: 44.1%) with statistically significant improvements ( $p<0.01$ , $p<0.001$ ) and >90% Top-4 accuracy. These ensemble systems are especially effective on diagnosis and treatment questions, but underperform on basic science and rare specialties due to correlated expert failures or artifact boosting.

This suggests that in high-stakes, multimodal medical QA, diversification—across both model architecture and expert agent composition—substantially increases reliability and accuracy, while preventing overfitting to model-specific blind spots.

6. Methodological Innovations and Directions for Extension

MedXpertQA-Multi is positioned as a reference standard for evaluating future clinical QA systems. Methodological best practices include:

Rigorous data synthesis (to prevent LLM memorization and overfitting)
Multi-modal item construction with enforced difficulty, verified by both AI and human experts in iterative review cycles
Strict separation of subsets (Reasoning/Understanding, Text/Image-guided) for fine-grained error analysis
Modular evaluation protocols facilitating assessment of ensemble, adaptive, and purely automatic systems

Extensions may leverage insights from related benchmarks: MedFrameQA (Yu et al., 22 May 2025) offers multi-image grounding and explicit temporal progression, while EHRXQA (Bae et al., 2023) demonstrates neural-symbolic integration for cross-modal table/image QA. Adapting these pipelines can further enrich MedXpertQA-Multi—e.g., composing multi-frame scenarios, implementing automated item difficulty calibration via in-the-loop filtering, and triaging unanswerable or ambiguous cases.

A plausible implication is that evolving MedXpertQA-Multi toward more realistic, tri-modal, and conversational settings will require unified vision-language-clinical modeling, calibrated uncertainty estimation, and continuous human-in-the-loop validation.

7. Impact, Limitations, and Prospects

MedXpertQA-Multi elevates the difficulty and breadth of medical QA evaluation, capturing core challenges in vision-language reasoning, differential diagnosis, and complex management decision-making. Its multifaceted integration of clinical images, structured data, and expert review standards advances the field’s capacity to benchmark both automatic and hybrid human-in-the-loop systems.

Limitations include possible non-generalizability due to sourcing (e.g., reliance on US/EU board exams, specific image sets), absence of free-form or conversational response formats, and current lack of protocols for uncertain or non-answerable queries. State-of-the-art models, despite exceeding human expert accuracy on benchmarked tasks, are still constrained by laboratory test conditions—real clinical deployments require enhanced risk calibration, human validation, and prospective trials.

Continued development of MedXpertQA-Multi and related benchmarks is needed to drive progress on clinically robust, explainable, and generalizable medical AI, supporting safe, effective clinical decision support and accelerating translation from research to practice.