MedXpertQA-Text: Medical QA Benchmark

Updated 17 November 2025

MedXpertQA-Text is a board-level medical question answering benchmark offering 2,450 text-only multiple-choice items to assess expert clinical reasoning.
It employs a multi-stage process—including AI, human expert, and similarity filtering—to ensure clinical authenticity and high diagnostic challenge.
The benchmark’s rigorous evaluation protocols highlight current model limitations in multi-step reasoning, differential diagnosis, and biomedical synthesis.

MedXpertQA-Text is a board-level medical question answering benchmark designed to evaluate expert-level medical reasoning and advanced understanding. Developed as a subset of the broader MedXpertQA suite (Zuo et al., 30 Jan 2025), it provides a comprehensive set of multiple-choice exam items spanning the major medical specialties and body systems, and is engineered for the rigorous assessment of both human and machine performance in clinical diagnosis, management, and biomedical knowledge synthesis.

1. Dataset Composition and Coverage

MedXpertQA-Text consists of 2,450 board-level, text-only multiple-choice questions, each with ten plausible answer options. These items were curated from core licensing exams (USMLE and COMLEX-USA), 17 American specialty board examinations, and leading textbooks/review banks. The benchmark encompasses 17 American Board–certified specialties, including Internal Medicine, Pediatrics, Cardiology, Neurology, Emergency Medicine, and more. Distribution is further stratified across 11 human body systems—such as Cardiovascular, Respiratory, Nervous, Digestive, Urinary, etc.—per Liachovitzky (2015). No single system or specialty is under-represented; for example, Cardiovascular and Respiratory subjects each represent ~15% of the corpus, providing balanced anatomical and clinical diversity.

All questions in this split use a ten-answer multiple-choice format. No open response, free-text, or fill-in-the-blank items are present.

2. Data Curation, Filtering, and Augmentation

A multi-stage process ensures that MedXpertQA-Text maintains clinical authenticity and difficulty:

AI Expert Filtering: Eight diverse models from open-source and proprietary LLMs and VLMs (including GPT-4o and Claude-3.5) attempted each candidate question multiple times. Questions solved too easily or consistently by these models were excluded from the benchmark, retaining only items that stumped all models.
Human Expert Filtering: Candidate answers were distributed to practicing physicians and expert annotators. For each question, the Brier score was computed to quantify uncertainty and discriminability:

$B = \frac{1}{N} \sum_{i=1}^N (y_i - \hat y_i)^2$

where $N$ is the number of options, $\hat y_i$ is the proportion of respondents selecting option $i$ , and $y_i$ the ground-truth indicator. The adaptive filtering retained items with high expert-rated difficulty and sufficient annotation.

Similarity Filtering: MedCPT-Query-Encoder was used to compute pairwise cosine similarity, removing near-duplicate pairs above a robust statistical threshold (75th percentile minus 1.5·IQR).
Augmentation: To mitigate public source leakage and improve distractor quality, all questions and answer options were rewritten and expanded via prompt engineering with large LLMs (GPT-4o, Claude). Options lacking clinical plausibility or suffering from semantic overlap were replaced. Manual audits by physician-reviewers corrected hallucinations or mis-phrasings.
Leakage Analysis: Instance-level perplexity (PPL) and N-gram similarity metrics (ROUGE-L, edit distance) were analyzed pre- and post-augmentation. Post-intervention, PPL increased substantially (from $1.03 \times 10^{218}$ to $1.35 \times 10^{247}$ ), confirming effective leakage reduction.

3. Reasoning-Focused Subset

Approximately 60% of MedXpertQA-Text items are classified as "Reasoning" questions. Annotation was performed by GPT-4o, distinguishing multi-step, clinical inference and synthesis (e.g., differential diagnosis, lab integration, test selection) from "Understanding" items targeting single-fact recall, anatomy, or terminology. The reasoning subset probes challenges such as:

Multi-step differential diagnosis and confirmatory testing
Epidemiological study design controlling for confounders
Pathophysiological integration across clinical stem and lab cascade

These distinctions enable the benchmark to transcend superficial fact recall, requiring genuine application and synthesis in clinical decision-making.

4. Evaluation Protocols and Metric Definitions

MedXpertQA-Text uses a five-question few-shot "dev" set and a 2,445-question test set; no dedicated training partition is included. The evaluation is conducted under a zero-shot protocol unless otherwise noted.

Primary metrics:

Accuracy (top-1): Proportion of items for which the model predicts the single correct answer:

$\text{Accuracy} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\hat{y}_i = y_i\}$

Weighted F1 (optional): Accounts for class imbalance (diagnosis, treatment, basic science), computed as:

$\text{Weighted-}F_1 = \sum_{c=1}^C \frac{N_c}{N} \cdot \frac{2 P_c R_c}{P_c + R_c}$

where $N_c$ is number of items of class $c$ , and $P_c$ , $R_c$ are precision/recall for $c$ .

Benchmark performance is summarized in the following table (from Table 4), which averages over Reasoning and Understanding slices:

Model	Accuracy (%)
o1 (10-option CoT)	44.67
DeepSeek-R1	37.76
GPT-4o	35.96
LLaMA-3.3-70B	24.49
DeepSeek-V3	24.16
Qwen2.5-72B-Instruct	18.90
Claude-3.5-Haiku	17.76
QwQ-32B	18.00
Qwen2.5-32B	15.06

On reasoning-only items, all models drop by 6–10 points, illustrating the challenge posed by genuine clinical inference.

5. Comparison to Prior Medical QA Benchmarks

MedXpertQA-Text is designed to exceed the scope and difficulty of preceding datasets:

MedQA-USMLE (1,273 items, 4–5 options) and MedMCQA (4,183 items) focus primarily on general medical knowledge; top models now exceed 95% on these datasets, indicating insufficient discriminative power.
PubMedQA and MMLU-Medical probe paper-based or recall-centric questions but lack comprehensive specialty coverage and multi-step reasoning.
MedXpertQA-Text supplies 2,450 newly curated, specialty-diverse, clinically realistic vignettes—each with information-rich stems (mean length: 257 tokens vs. 215 for MedQA-USMLE) and ten plausible distractors per item. It includes a dedicated reasoning subset and strict filtering, resulting in all leading models (including GPT-4o and o1) performing below 50% accuracy.

6. Illustrative Examples and Task Characteristics

Questions on MedXpertQA-Text cover domains such as diagnosis, treatment, and basic science, formatted as ten-option multiple-choice items. Two representative examples:

Reasoning Example:

Question: A 3-year-old boy presents with a "seal-like barking" cough, inspiratory stridor, low-grade fever, and intercostal retractions. Suspecting viral croup, which finding is most characteristic on a frontal neck X-ray?

Answer: (D) Steeple sign. Rationale: Subglottic narrowing in croup produces a "steeple" appearance on AP view; thumbprint sign appears laterally in epiglottitis.

Understanding Example:

Question: A researcher matches 75 BRUE infants and 75 controls by age, SES, and family history, then compares prior URIs. Which design element controls confounding?

Answer: (C) Matching. Rationale: Pairing on confounders holds variables constant; other methods (e.g., randomization) are not retrospective.

The average stem length, distractor complexity, and rigorous post-processing distinguish MedXpertQA-Text from earlier resources.

7. Significance for Model Development and Assessment

MedXpertQA-Text is the first board-level QA benchmark to systematically sample from 17 specialties and 11 organ systems, with content validated and diversified to minimize information leakage and optimize clinical realism. Evaluation on this benchmark exposes the limits of current LLMs (including GPT-4o), especially in clinical reasoning, differential diagnosis, and multi-step inference.

The dataset facilitates fine-grained analysis of both correctness and reasoning quality—essential for the development of next-generation clinical text reasoning models, ensemble approaches (e.g., consensus mechanisms (2505.23075)), and chain-of-thought evaluators. As illustrated by results from (Wang et al., 11 Aug 2025), continuous benchmarking on MedXpertQA-Text informs advances in clinical decision-support systems and highlights the importance of calibration, interpretability, and robust biomedical understanding in both zero-shot and few-shot settings.