MedQA (USMLE) Benchmark Overview
- MedQA (USMLE) is a benchmark defined by clinical vignettes and multi-hop reasoning, derived from official medical licensure exams.
- It supports evaluation of diverse models, from retrieval-based systems to domain-adapted fine-tuning and chain-of-thought LLMs.
- Results from MedQA drive advancements in automated clinical reasoning while highlighting challenges in evidence integration and error handling.
MedQA (USMLE) is a large-scale, multilingual, high-difficulty medical question answering benchmark derived from professional medical licensure examinations, with a particular focus on the English-language United States Medical Licensing Examination. It serves as a primary evaluation standard for automated medical reasoning models, encompassing multi-step clinical vignettes and diagnostic reasoning, and has become the de facto benchmark for assessing the capabilities of pre-trained LLMs, retrieval-based systems, and hybrid neural-symbolic architectures in expert-level medical QA settings.
1. Dataset Construction and Characteristics
MedQA was originally introduced as a free-form, multiple-choice open QA benchmark compiled from official USMLE, Chinese National Medical Licensing Exams (MCMLE), and the Taiwan Medical Licensing Examination (TWMLE) (Jin et al., 2020). The English USMLE subset includes 12,723 questions split into 10,178 training, 1,272 development, and 1,273 test instances. Each question is paired with four candidate answers (reduced from five in the source where necessary), and each has a single gold label.
USMLE MedQA questions predominantly assess complex multi-step reasoning (98% clinical vignettes, 2% single-fact recall), requiring the integration of patient history, laboratory findings, imaging/pathological data, and specialist-level knowledge in areas such as pathology, pharmacology, and genetics. The document collection for USMLE includes 18 authoritative medical textbooks (12.7 million tokens), indexed and used for retrieval-based answering (Jin et al., 2020).
Key dataset properties:
- Four-choice, single-answer per question, with randomized option order.
- High prevalence of "Type 2" multi-hop reasoning.
- High evidence requirement: 88% of USMLE items judged "covered" by available documents by MD annotators; remaining demand knowledge integration or inference beyond text.
- Stringent evaluation: accuracy is the primary metric, with exact match to gold-standard option.
2. Modelling Paradigms and State-of-the-Art Methods
MedQA (USMLE) catalyzed a diverse methodological landscape, including document retrieval plus neural reading comprehension (Jin et al., 2020, Zhang et al., 2018), Transformer-based language modeling, domain adaptation, knowledge graph integration, and large-scale chain-of-thought prompting.
2.1 Retrieval-Augmented and Reading Comprehension Models
Baseline architectures combined IR techniques (BM25, re-weighted variants) for document selection with context-aware neural readers (BiGRU, max-out architectures, BERT-derived models). Performance ceiling for best 2020–2022 neural readers (BioBERT-Large) plateaued at ~36.7% test accuracy for English USMLE, and retrieval was identified as the principal bottleneck—top-25 paragraph retrieval covered only 24% of USMLE items (Jin et al., 2020). SeaReader introduced dual-path attention and document gating, achieving 75.3% on the Chinese MedQA, but these results do not transfer to the English USMLE subset (Zhang et al., 2018).
2.2 LLMs and Prompt-Based Paradigms
The introduction of LLMs dramatically advanced performance. Prompt engineering, especially chain-of-thought (CoT) and few-shot CoT, leads to substantial gains:
- GPT-3.5, Codex: self-consistency ensemble (k=100, 5-shot CoT) attains 60.2% test accuracy; zero-shot CoT yields 46.1%; direct prompting 46.0% (Liévin et al., 2022).
- Open-source LLMs (Llama-2 70B): 5-shot CoT + self-consistency yields 62.5%, surpassing the physician passing bar (60%) (Liévin et al., 2022).
- Prompt ensembling, temperature tuning, and retrieval-augmented prompting add incremental improvements (~1–3 percentage points with basic retrieval).
- Current SOTA: GPT-4 ~81.4% (Bhatti et al., 2023), GPT-4o 87.7%–90.9% without hallucination mitigation (Garcia-Fernandez et al., 10 Jun 2025), GPT-5 95.2% (Wang et al., 11 Aug 2025).
2.3 Domain-Adaptive Fine-Tuning
Domain adaptation via specialist corpora fine-tuning achieves robust improvements, especially for small/medium LLMs:
- MedGemma 4B (USMLE): domain tuning yields +6.8 pp gain (53.3% vs. 46.4%, p < 10⁻⁴) over general-purpose Gemma 4B (Buskila, 26 Apr 2026).
- SM70 (Llama-2 70B + MedAlpaca): 60.8%, a clear +7.2 point gain over GPT-3.5 (53.6%), but still ~20 points behind GPT-4 (Bhatti et al., 2023).
- MedMobile (phi-3-mini, 3.8B): +8.4 points from medical SFT; reaches 75.7%, outperforming Flan-PaLM 540B (67.6%) (Vishwanath et al., 2024).
2.4 Retrieval-Augmented Generation (RAG) and Iterative RAG
RAG appends relevant passages to the prompt or uses them within answer generation:
- For small (4B) models, RAG generally fails to provide statistically significant gains on USMLE (Δ<2%, p>0.05) and may slightly decrease accuracy in domain-adapted setups (Buskila, 26 Apr 2026).
- On large models or more powerful pipelines, vanilla RAG improves test performance by up to 3 points, but iterative RAG (i-MedRAG) that enables multi-turn, adaptive information-seeking and multi-hop reasoning achieves stronger results: 69.7% for GPT-3.5 zero-shot, besting all prior few-shot and fine-tuning methods (Xiong et al., 2024).
- In MedMobile, RAG reduced accuracy by 12.6% due to context window overload (Vishwanath et al., 2024).
2.5 Reasoning-Selective and Chain-of-Thought Techniques
Always-on chain-of-thought boosting interpretability and accuracy incurs significant inference and token cost. Selective CoT, which learns to generate rationales only when needed, achieves 64.0% on MedQA (Llama-3.1-8B), reducing latency by 29% for just a 0.6-point accuracy drop (Zhan et al., 23 Feb 2026).
2.6 Knowledge Graph-Driven Hybrids and Knowledge Injection
Structured knowledge models (Kformer, QAT, GreaseLM) augment Transformer representations with explicit injection of medical KGs or multi-hop relation tokens:
- QAT (Meta-Path token construction and relation-aware self-attention): 39.3% (Park et al., 2022)
- Kformer (FFN-level injection): 30.0% (Yao et al., 2022)
- GreaseLM (cross-modal fusion): 38.5% (Zhang et al., 2022) These outperform non-KG baselines by 1–2 points in this regime but have been surpassed by LLM-centric, CoT-based pipelines.
2.7 Reasoning Distillation and Small Model Compression
Knowledge-Augmented Reasoning Distillation (KARD) enables small models (T5-250M, 780M) to match or surpass larger LM baselines using high-quality LLM rationales and Wikipedia-augmented reranked knowledge at train/inference time, reaching 44.6% for T5-780M versus 39.1% for fine-tuned 3B models (Kang et al., 2023).
3. Error Taxonomy and Hallucination Analysis
High-performing LLMs, notably GPT-4, exhibit error rates of about 13.3–14% on the MedQA-USMLE dataset; Med-PaLM 2 achieves a similar profile (Roy et al., 2024). Annotated qualitative studies reveal:
- 49% of strict-criterion errors are reasoning-based: "sticking with the wrong diagnosis" (26.9%), vague or incorrect conclusions (22.2%), and ignoring missing evidence.
- Knowledge-based errors (unsupported claims, rare non-medical slips) make up ~13%, while hallucination and reading comprehension errors are rarer.
- 28.8% of GPT-4's errors are judged "reasonable responses," reflecting ambiguous or borderline items that stump even clinicians.
Advanced hallucination detection and suppression frameworks, such as CHECK (continuous-learning, information-theoretic), reduce hallucination from 31% to 0.3% on open LLMs and boost GPT-4o USMLE accuracy from 87.7% to 92.1% with chain-of-thought and selective majority voting (Garcia-Fernandez et al., 10 Jun 2025).
4. Open-Ended and Multimodal Extensions
Recent work extends MedQA (USMLE) to open-ended (no-options) and multimodal variants:
- MedQA-OPEN/No-Opt eliminates the answer set, requiring free-text diagnosis or management decisions. Prompting with incremental chain-of-thought (CLINICR) and reward-verifier models closes most of the gap to MCQ performance; agreement with expert-annotated answers is ~80–89% for best approaches (Nachane et al., 2024).
- DERA formalizes dialog-enabled resolving agents and demonstrates GPT-4 and dialogic ensemble performance stabilizing at 74.4–74.6% in open-ended MedQA, well above the 60% passing bar but with little marginal improvement from dialog on ultra-short QA (Nair et al., 2023).
5. Comparative Benchmarks and Clinical Relevance
Performance on MedQA (USMLE) is now one of the most prominent markers for "medical expert-level" AI, but recent benchmarks highlight its limits:
- MedXpertQA, released in 2025, substantially increases challenge and clinical specialty coverage—top models barely reach 49% accuracy, with expanded option sets (10 per text question), real images, and rigorous AI+human filtering (Zuo et al., 30 Jan 2025).
- Zero-shot GPT-5 achieves >95% average accuracy on USMLE self-assessments, overtaking GPT-4o by almost 3 points, with multimodal and chain-of-thought input protocols (Wang et al., 11 Aug 2025).
- MedQA remains the most widely used medical QA benchmark for core reasoning, but does not yet cover the full spectrum of board-style specialty, multimodal, or longitudinal patient tasks seen in clinical practice.
MedQA-USMLE Benchmark: Selected Model Accuracies
| Model/System | Parameters | Accuracy (%) | Reference |
|---|---|---|---|
| BioBERT-Large (Reader only) | 345M | 36.7 | (Jin et al., 2020) |
| QAT (KG-Aware Transformer) | 110M | 39.3 | (Park et al., 2022) |
| Codex 5-shot CoT + SC ensemble | 175B | 60.2 | (Liévin et al., 2022) |
| Llama-2 70B, CoT + ensemble | 70B | 62.5 | (Liévin et al., 2022) |
| SM70 (Llama 2 70B, MedAlpaca) | 70B | 60.8 | (Bhatti et al., 2023) |
| MedGemma 4B (domain-tuned) | 4B | 53.3 | (Buskila, 26 Apr 2026) |
| MedMobile (phi-3-mini SFT+CoT) | 3.8B | 75.7 | (Vishwanath et al., 2024) |
| GPT-3.5, ensemble | 175B | 60.2 | (Liévin et al., 2022) |
| GPT-4 (5-shot CoT) | undisclosed | 81.4 | (Bhatti et al., 2023) |
| GPT-4o + CoT + CHECK | undisclosed | 92.1 | (Garcia-Fernandez et al., 10 Jun 2025) |
| GPT-5 (zero-shot CoT) | undisclosed | 95.2 | (Wang et al., 11 Aug 2025) |
6. Methodological Lessons and Future Directions
- Domain fine-tuning remains critical at low parameter counts and is more effective than context-level retrieval for small models (Buskila, 26 Apr 2026).
- Retrieval-augmented methods excel chiefly when retrieval and reasoning are both multi-step, adaptive, and tightly integrated (i-MedRAG) (Xiong et al., 2024).
- Chain-of-thought and self-consistency/ensemble strategies robustly boost medical LLM accuracy and calibration, with marginal gains from advanced selection/reranking and prompt engineering (Liévin et al., 2022).
- Failure cases are dominated by anchoring errors, insufficient uncertainty communication, or missing/incomplete evidence in the model's context (Roy et al., 2024).
- Scaling to more advanced, specialty-diverse, and multimodal benchmarks (MedXpertQA) will be pivotal for further progress and closing the human–AI clinical reasoning gap (Zuo et al., 30 Jan 2025).
7. Reproducibility, Open Resources, and Benchmark Impact
- Official implementations, evaluation scripts, and query traces are routinely released for reproducibility, e.g., experiment JSONL and code for the Gemma/MedGemma comparison (Buskila, 26 Apr 2026), MedQA-OPEN (Nachane et al., 2024), KARD (Kang et al., 2023), GPT-4 error annotations (Roy et al., 2024), and DERA open-answer splits (Nair et al., 2023).
- The MedQA (USMLE) benchmark continues to set the standard for research in medical QA, catalyzing advances in modeling, reasoning, interpretability, and clinical AI safety, and serving as a reference point for new datasets and methods in medical AI.