MedQA Clinical QA Benchmark

Updated 3 July 2026

MedQA is a comprehensive benchmark derived from board exam questions, evaluating NLP models' clinical reasoning and factual recall.
The dataset employs multi-option open-domain QA with BM25 retrieval and Transformer-based readers to select the correct answer.
Advances such as dual-path attention and retrieval augmentation boost performance while challenges remain in multi-turn reasoning and interpretability.

MedQA is a family of challenging large-scale question answering (QA) benchmarks and associated tasks designed to evaluate the factual knowledge and clinical reasoning capabilities of NLP models in the medical domain. Originating as a multiple-choice, open-domain QA corpus derived from professional medical licensing exams, MedQA and its derivatives have become primary benchmarks for clinical LLMs, spawning extensive research on retrieval-augmented reasoning, safety, explanation, and robustness.

1. Benchmark Construction and Dataset Structure

MedQA, as originally introduced, is compiled from real-world board exam questions and covers English (USMLE), simplified Chinese (MCMLE), and traditional Chinese (TWMLE) (Jin et al., 2020). For the English version (USMLE):

Scale: 12,723 questions (10,178 train, 1,272 dev, 1,273 test)
Question format: Multi-sentence clinical vignettes or factual queries, with 4 options each (A–D), exactly one correct.
Topics: Comprehensive coverage including internal medicine, surgery, pediatrics, psychiatry, pharmacology, pathology, and more; USMLE is 98% clinical vignettes.
Evidence corpus: 18 major medical textbooks OCRed into 231,581 paragraphs (12.7M tokens, English).
Annotation: Board-certified clinicians audited sample splits, confirming ~88% of USMLE items are answerable from the curated corpus.

MedQA’s non-English subsets (MCMLE, TWMLE) possess analogous scale and coverage, with evidence from 33 and 18 textbooks, respectively.

2. Task Formulation, Baselines, and Evaluation

The canonical MedQA task is multi-option open-domain QA, requiring reading comprehension and retrieval:

Pipeline: For each question-option pair, retrieve top-N relevant paragraphs (BM25-based IR), then score each option using a document reader model and select the argmax.
Baseline readers: PMI, maxout-BiGRU, and pretrained Transformer variants (BioBERT, RoBERTa, mBERT), with the latter yielding best performance on Chinese (e.g., RoBERTa-Large-wwm-ext achieves 70.1% on MCMLE).
Metrics: Accuracy is defined as the fraction of questions for which the model selects the correct answer. For English MedQA (BioBERT-Large): 36.7% test accuracy.

Challenges include low retrieval rates (e.g., only 24% “full” evidence recall @25 for USMLE), significant proportion of multi-hop or cross-paragraph reasoning, and performance bottlenecks even for strong pretrained baselines (Jin et al., 2020).

3. Advances and Model Architectures

Subsequent studies leverage MedQA as a gold standard for evaluation and as a source for new methodologies:

3.1 SeaReader: Modular Dual-Path Attention

The SeaReader approach (Zhang et al., 2018) introduces a dual-path attention LSTM model that aligns question-option “statements” both to individual evidence documents and across documents, incorporating gating and interpretability features (e.g., importance-penalized gating, noisy gating for robustness). SeaReader achieves 73.6% test accuracy (ensemble: 75.3%) on the Chinese MedQA, outperforming baseline models by ~10 percentage points.

3.2 Retrieval-Augmented and Hybrid Approaches

Self-MedRAG (Ryan et al., 8 Jan 2026): Combines BM25 and Contriever dense retrieval via Reciprocal Rank Fusion (RRF), then generates answers with supporting rationales, which are iteratively verified by NLI or LLM-based critics. Hybrid retrieval alone raises MedQA accuracy from ~41–43% (single retriever) to 80.0%; with self-reflective reasoning, accuracy reaches 83.33%.
LLM-MedQA (multi-agent CoT + case generation) (Yang et al., 2024): Employs a multi-phase expert decomposition pipeline culminating in agent voting on generated reports, achieving a 7 point gain (77.2% vs. 71.9% for CoT+Self-Consistency) in zero-shot settings with Llama3.1-70B.

3.3 Chain-of-Thought and Efficiency

Selective Chain-of-Thought (Zhan et al., 23 Feb 2026): Gated generation of rationales only when required (“Do you need step-by-step reasoning?”), reducing MedQA inference time and token usage by up to 45% with ≤4% accuracy loss (Qwen-7B: 57.1%→54.6%).
CLINICR framework (MEDQA-OPEN) (Nachane et al., 2024): Converts MedQA into open-ended queries, using additive incremental reasoning and a reward model to match smaller models to larger ones on clinician-graded outputs.

4. Robustness, Explainability, and Skill Evaluation

4.1 Multifaceted and Multi-Turn Probing

MultifacetEval (Zhou et al., 2024): Derives MultiMedQA by reframing MedQA items into four cognitive facets (comparison, rectification, discrimination, verification). Although top LLMs score >60% on MCQ (comparison), true mastery (all facets) drops by 40–50 points due to lack of robust meta-reasoning and correction.
MedQA-Followup (Manczak et al., 14 Oct 2025): Probes shallow (single-turn) versus deep (multi-turn) robustness using controlled misleading interventions. Multi-turn context-based attacks cause catastrophic drops (e.g., Claude Sonnet 4: 91.2%→13.5%), revealing a critical vulnerability under conversational pressure.

4.2 Open-ended Clinical Skills

MedQA-CS (Yao et al., 2024): Adapts MedQA towards structured clinical skill assessment via AI-Structured Clinical Exam (AI-SCE) formats. Unlike MCQ, this framework emphasizes open-ended history-taking, physical exam, closure, and differential diagnosis, evaluated by LLM or human rubrics. LLMs achieve ~60/100 (GPT-4), well below MCQ (>90%).

4.3 Explanation Benchmarks

MedExQA (Kim et al., 2024): Focuses on underrepresented specialties and requires two distinct reference rationales per MedQA-style item. Models are evaluated not only by MCQ accuracy but also by BLEU, ROUGE-L, METEOR, BERTScore against both references. MedPhi-2 (2.7B) surpasses Llama2-70B in explanation quality.

5. Comparative Evaluation and State-of-the-Art Performance

MedQA remains a pivotal scaling benchmark for both open-source and closed models. Comparative results:

Model	Params	MedQA Accuracy (%)
BioGPT-large	1.5B	40.7
BioMedLM	2.7B	46.3
LLaMA 2 7B	7B	47.3
Mistral 7B	7B	59.1
Mistral 7B (+MedMCQA)	7B	63.0
Med-PaLM 2 ensemble	-	86.5
GPT-4 (Medprompt)	-	90.2
LLM-MedQA (Llama3.1-70B)	70B	77.2*
Self-MedRAG (hybrid+reflection)	-	83.3

*accuracies reflect model-specific settings (few-shot/zero-shot, retrieval, voting, etc.) (Yang et al., 2024, Ryan et al., 8 Jan 2026, Bolton et al., 2024)

Empirical trends:

Model scaling and in-domain adaptation improve MedQA performance (e.g., Mistral 7B matches or surpasses small specialist models) (Bolton et al., 2024).
Retrieval augmentation, multi-agent consensus, and explicit reflection/critiquing consistently boost accuracy.
Even top models retain significant error rates and require interpretability and safety scaffolds before clinical deployment.

6. Multimodal and Multispan Extensions

Recent work pushes MedQA tasks beyond text-only MCQ:

M³ QuestionIng (Saha et al., 19 May 2026): Introduces a multi-span, multi-modal benchmark requiring answer selection over both text and images, with user intent and query-type labels. The M³QAFrame model fuses BiomedBERT and vision transformers, reaching 94.34% macro F1, well above unimodal and off-the-shelf VLMs.
RJUA-MedDQA (Jin et al., 2024): Chinese medical report images + QA pairs (entity, table, numerical, reasoning). With advanced structural annotation (ESRA), ESRA+GPT-4 outperforms best LMMs by >20 points on extractive and reasoning tasks.

7. Current Limitations and Directions

MedQA and its derivatives force models to go beyond memorized recall, demanding authentic, multi-step reasoning under open-domain uncertainty. However:

Performance on MCQ does not imply depth of clinical mastery, interpretability, or robustness (Zhou et al., 2024).
Multi-agent and reflective methods improve performance but at the cost of high computational overhead and the risk of latent bias or inconsistent peer assessment (Zhan et al., 13 Jun 2026, Yang et al., 2024).
Multimodal and multi-turn challenges—such as integrating structured images, noisy reports, or persistent context—remain unsolved at scale (Manczak et al., 14 Oct 2025, Jin et al., 2024).
Real-world deployment must prioritize multi-faceted validation, safety triage, and human oversight (Lechner et al., 2023, Manczak et al., 14 Oct 2025).

Active research continues to adapt MedQA-style tasks for richer, clinically meaningful benchmarks (e.g., open-ended skills, explainable reasoning, multimodal QA), closing the gap between academic evaluation and safe, trustworthy medical AI applications.