MedXpertQA: Clinical AI Benchmark

Updated 15 August 2025

MedXpertQA is a comprehensive benchmark designed to evaluate advanced medical reasoning in AI by integrating complex text and image-based clinical cases.
It employs multi-stage data curation, hierarchical filtering, and expert review to ensure high data quality and clinical realism.
The benchmark covers 17 specialties and 11 body systems, challenging models with exam-level questions and real-world clinical scenarios.

MedXpertQA is a comprehensive, expert-level benchmark for evaluating advanced medical knowledge and clinical reasoning in artificial intelligence systems. Designed to address limitations of prior medical QA datasets, MedXpertQA integrates a wide range of real-world clinical scenarios—including multimodal, image-based cases and high-difficulty specialty exam questions—to rigorously assess both text-only and multimodal medical reasoning. This benchmark has catalyzed the development and evaluation of cutting-edge systems targeting human-expert or super-human performance in clinical decision support.

1. Benchmark Composition and Structure

MedXpertQA consists of 4,460 questions spanning 17 distinct medical specialties and covering 11 human body systems (Zuo et al., 30 Jan 2025). The dataset is bifurcated into two principal subsets:

Text Subset: Focuses on text-only multiple-choice questions sourced from standardized examinations (USMLE, COMLEX, specialty board exams) and advanced medical textbooks. These questions encompass core clinical knowledge, subspecialty nuances, and complex diagnostic scenarios.
MM (Multimodal) Subset: Contains 2,005 challenging exam-style cases, richly augmented with 2,839 images. Image modalities include radiology (plain films, CT, MRI), pathology, medical optical images, diagrams, charts, tables, and structured clinical documents. Questions in this subset are constructed to replicate real-world clinical workflows and accompany detailed patient records or laboratory findings.

All items are comprehensively annotated across multiple axes: medical specialty, body system, primary task (Diagnosis, Treatment Planning, Basic Science), and fine-grained subtask (differential diagnosis, etiology, etc.), leveraging both LLM annotation and expert domain knowledge (Zuo et al., 30 Jan 2025).

2. Data Curation, Filtering, and Synthesis

MedXpertQA implements a multi-stage data curation process to ensure question difficulty, accuracy, and novelty:

Data Synthesis: Three proprietary LLMs are utilized to paraphrase, rewrite, or augment both question stems and distractor options. This “question and option augmentation” is crucial for minimizing the risk of data leakage from training corpora and increases diversity in phrasing and distractors.
Hierarchical Filtering: An automated filtering stage first removes questions deemed too easy, as identified by repeated correct predictions across several leading LLMs. Next, a calibrated adaptive Brier score is employed:

$B = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2$

where $N$ is the number of answer choices, $y_i$ is the ground-truth label (0/1), and $\hat{y}_i$ is the proportion of human responses for option $i$ .

Similarity Filtering: MedCPT-based cosine similarity (derived from PubMedBERT embeddings) is used to eliminate near-duplicate or excessively similar question content.
Expert Review: Each question undergoes iterative review by physician-licensed experts. The review protocol targets errors in factuality, ambiguity, misleading distractors, and the presence of hallucinations. Multiple revision cycles ensure high clinical validity and applicability (Zuo et al., 30 Jan 2025).

3. Coverage, Clinical Realism, and Task Diversity

Distinct from previous benchmarks (such as MedQA or MedMCQA), MedXpertQA emphasizes:

Extensive Specialty and Subdomain Coverage: 17 specialties, 11 body systems, and comprehensive task mapping ensure the dataset represents a wide spectrum of real practice, from primary care through advanced subspecialties.
Expert-Level and Real-World Scenarios: By incorporating official board-level questions and imaging-augmented cases, MedXpertQA aims for clinical fidelity in its scenarios. The MM subset, in particular, contains rich patient records and image-based reasoning requirements that more accurately simulate actual patient management.
Task Diversity: Annotations distinguish between basic knowledge (Understanding) and complex, multi-step clinical reasoning (Reasoning). Dedicated “reasoning-oriented” questions challenge models in differential diagnosis, treatment selection with co-morbidities, and integration of heterogeneous clinical findings—tasks relevant for real-world decision-making (Zuo et al., 30 Jan 2025).

4. Model Benchmarking and Performance Insights

A systematic evaluation protocol has been adopted, comparing 16+ state-of-the-art systems using both proprietary and open-source LLMs and LMMs:

Comparative Results: Even the best-performing models in early benchmarks (e.g., GPT-4o, o1, QVQ-72B-Preview) exhibited substantial errors on reasoning-intensive or multimodal tasks. The most advanced models remain below 70% accuracy on the hardest subsets.
Ensemble and Consensus Approaches: Recent research demonstrates that ensemble-based consensus mechanisms, mimicking multidisciplinary clinical decision-making, further enhance accuracy on MedXpertQA—achieving 61.0% vs. 53.5% for leading single-model systems (O3, Gemini 2.5 Pro) (2505.23075).
Super-Human Model Advances: Notably, GPT-5 displayed super-human performance on the MedXpertQA MM subset, with reasoning and understanding scores improved by +29.26% and +26.18% over GPT-4o, and by +24.23% (reasoning) and +29.40% (understanding) over pre-licensed human experts (Wang et al., 11 Aug 2025).
Zero-Shot Chain-of-Thought Protocol: Evaluation adheres to a unified zero-shot CoT approach, prompting models to generate full rationales before answer selection. For multimodal items, structured JSON prompts combine text and image URLs, ensuring models process all modalities cohesively.

5. Technical Highlights, Data Access, and Reproducibility

Advanced Filtering and Sampling Algorithms: The filtering structure—combining LLM-based hardness assessments with adaptive Brier scoring and semantic similarity checks—ensures both coverage and robust challenge level.
Open-Source Release: Both the data and codebase are publicly available at https://github.com/TsinghuaC3I/MedXpertQA (Zuo et al., 30 Jan 2025), supporting transparency and reproducibility.
Protocols and Tools: Evaluation scripts, prompt templates (including LaTeX-readable JSON chain-of-thought formats), and comprehensive documentation are provided. This enables direct assessment and benchmarking by the broader research community.
Integration Pathways: The design allows direct interoperability with emerging frameworks for clinical visual question answering, ensemble consensus engines, and new large multimodal models.

6. Impact, Applications, and Future Directions

MedXpertQA has set a new standard for assessing reasoning, understanding, and generalization in medical AI systems:

Model Development: As the hardest widely used medical QA benchmark, it has become the principal testbed for new LLMs and LMMs targeting “expert-level” clinical reasoning, driving innovations demonstrated in systems like GPT-5 (Wang et al., 11 Aug 2025), MedGemma (Sellergren et al., 7 Jul 2025), and advanced ensemble frameworks (2505.23075).
Clinical Decision Support: The benchmark’s ability to discriminate between mere factual recall and genuine reasoning over complex, multimodal patient data positions it as a foundation for trustworthy clinical AI evaluation.
Research Opportunities: Open questions remain in bridging the gap between model and expert performance on the most difficult cases, handling rare or ambiguous presentations, and better integrating longitudinal and multimodal patient data as exemplified in related multimodal and progression datasets (e.g., MMXU (Mu et al., 17 Feb 2025)).
Extensibility: MedXpertQA’s modular design and annotation scheme facilitate extensions to dialog-based, open-ended, or non-English medical QA challenges, thereby broadening its utility as the field advances.

In summary, MedXpertQA provides a rigorous, multifaceted, and clinically realistic evaluation suite for expert-level medical question answering and reasoning, shaping both the design and benchmarking of next-generation clinical AI systems.