Papers
Topics
Authors
Recent
Search
2000 character limit reached

MedQA Clinical QA Benchmark

Updated 3 July 2026
  • MedQA is a comprehensive benchmark derived from board exam questions, evaluating NLP models' clinical reasoning and factual recall.
  • The dataset employs multi-option open-domain QA with BM25 retrieval and Transformer-based readers to select the correct answer.
  • Advances such as dual-path attention and retrieval augmentation boost performance while challenges remain in multi-turn reasoning and interpretability.

MedQA is a family of challenging large-scale question answering (QA) benchmarks and associated tasks designed to evaluate the factual knowledge and clinical reasoning capabilities of NLP models in the medical domain. Originating as a multiple-choice, open-domain QA corpus derived from professional medical licensing exams, MedQA and its derivatives have become primary benchmarks for clinical LLMs, spawning extensive research on retrieval-augmented reasoning, safety, explanation, and robustness.

1. Benchmark Construction and Dataset Structure

MedQA, as originally introduced, is compiled from real-world board exam questions and covers English (USMLE), simplified Chinese (MCMLE), and traditional Chinese (TWMLE) (Jin et al., 2020). For the English version (USMLE):

  • Scale: 12,723 questions (10,178 train, 1,272 dev, 1,273 test)
  • Question format: Multi-sentence clinical vignettes or factual queries, with 4 options each (A–D), exactly one correct.
  • Topics: Comprehensive coverage including internal medicine, surgery, pediatrics, psychiatry, pharmacology, pathology, and more; USMLE is 98% clinical vignettes.
  • Evidence corpus: 18 major medical textbooks OCRed into 231,581 paragraphs (12.7M tokens, English).
  • Annotation: Board-certified clinicians audited sample splits, confirming ~88% of USMLE items are answerable from the curated corpus.

MedQA’s non-English subsets (MCMLE, TWMLE) possess analogous scale and coverage, with evidence from 33 and 18 textbooks, respectively.

2. Task Formulation, Baselines, and Evaluation

The canonical MedQA task is multi-option open-domain QA, requiring reading comprehension and retrieval:

  • Pipeline: For each question-option pair, retrieve top-N relevant paragraphs (BM25-based IR), then score each option using a document reader model and select the argmax.
  • Baseline readers: PMI, maxout-BiGRU, and pretrained Transformer variants (BioBERT, RoBERTa, mBERT), with the latter yielding best performance on Chinese (e.g., RoBERTa-Large-wwm-ext achieves 70.1% on MCMLE).
  • Metrics: Accuracy is defined as the fraction of questions for which the model selects the correct answer. For English MedQA (BioBERT-Large): 36.7% test accuracy.

Challenges include low retrieval rates (e.g., only 24% “full” evidence recall @25 for USMLE), significant proportion of multi-hop or cross-paragraph reasoning, and performance bottlenecks even for strong pretrained baselines (Jin et al., 2020).

3. Advances and Model Architectures

Subsequent studies leverage MedQA as a gold standard for evaluation and as a source for new methodologies:

3.1 SeaReader: Modular Dual-Path Attention

The SeaReader approach (Zhang et al., 2018) introduces a dual-path attention LSTM model that aligns question-option “statements” both to individual evidence documents and across documents, incorporating gating and interpretability features (e.g., importance-penalized gating, noisy gating for robustness). SeaReader achieves 73.6% test accuracy (ensemble: 75.3%) on the Chinese MedQA, outperforming baseline models by ~10 percentage points.

3.2 Retrieval-Augmented and Hybrid Approaches

  • Self-MedRAG (Ryan et al., 8 Jan 2026): Combines BM25 and Contriever dense retrieval via Reciprocal Rank Fusion (RRF), then generates answers with supporting rationales, which are iteratively verified by NLI or LLM-based critics. Hybrid retrieval alone raises MedQA accuracy from ~41–43% (single retriever) to 80.0%; with self-reflective reasoning, accuracy reaches 83.33%.
  • LLM-MedQA (multi-agent CoT + case generation) (Yang et al., 2024): Employs a multi-phase expert decomposition pipeline culminating in agent voting on generated reports, achieving a 7 point gain (77.2% vs. 71.9% for CoT+Self-Consistency) in zero-shot settings with Llama3.1-70B.

3.3 Chain-of-Thought and Efficiency

  • Selective Chain-of-Thought (Zhan et al., 23 Feb 2026): Gated generation of rationales only when required (“Do you need step-by-step reasoning?”), reducing MedQA inference time and token usage by up to 45% with ≤4% accuracy loss (Qwen-7B: 57.1%→54.6%).
  • CLINICR framework (MEDQA-OPEN) (Nachane et al., 2024): Converts MedQA into open-ended queries, using additive incremental reasoning and a reward model to match smaller models to larger ones on clinician-graded outputs.

4. Robustness, Explainability, and Skill Evaluation

4.1 Multifaceted and Multi-Turn Probing

  • MultifacetEval (Zhou et al., 2024): Derives MultiMedQA by reframing MedQA items into four cognitive facets (comparison, rectification, discrimination, verification). Although top LLMs score >60% on MCQ (comparison), true mastery (all facets) drops by 40–50 points due to lack of robust meta-reasoning and correction.
  • MedQA-Followup (Manczak et al., 14 Oct 2025): Probes shallow (single-turn) versus deep (multi-turn) robustness using controlled misleading interventions. Multi-turn context-based attacks cause catastrophic drops (e.g., Claude Sonnet 4: 91.2%→13.5%), revealing a critical vulnerability under conversational pressure.

4.2 Open-ended Clinical Skills

  • MedQA-CS (Yao et al., 2024): Adapts MedQA towards structured clinical skill assessment via AI-Structured Clinical Exam (AI-SCE) formats. Unlike MCQ, this framework emphasizes open-ended history-taking, physical exam, closure, and differential diagnosis, evaluated by LLM or human rubrics. LLMs achieve ~60/100 (GPT-4), well below MCQ (>90%).

4.3 Explanation Benchmarks

  • MedExQA (Kim et al., 2024): Focuses on underrepresented specialties and requires two distinct reference rationales per MedQA-style item. Models are evaluated not only by MCQ accuracy but also by BLEU, ROUGE-L, METEOR, BERTScore against both references. MedPhi-2 (2.7B) surpasses Llama2-70B in explanation quality.

5. Comparative Evaluation and State-of-the-Art Performance

MedQA remains a pivotal scaling benchmark for both open-source and closed models. Comparative results:

Model Params MedQA Accuracy (%)
BioGPT-large 1.5B 40.7
BioMedLM 2.7B 46.3
LLaMA 2 7B 7B 47.3
Mistral 7B 7B 59.1
Mistral 7B (+MedMCQA) 7B 63.0
Med-PaLM 2 ensemble - 86.5
GPT-4 (Medprompt) - 90.2
LLM-MedQA (Llama3.1-70B) 70B 77.2*
Self-MedRAG (hybrid+reflection) - 83.3

*accuracies reflect model-specific settings (few-shot/zero-shot, retrieval, voting, etc.) (Yang et al., 2024, Ryan et al., 8 Jan 2026, Bolton et al., 2024)

Empirical trends:

  • Model scaling and in-domain adaptation improve MedQA performance (e.g., Mistral 7B matches or surpasses small specialist models) (Bolton et al., 2024).
  • Retrieval augmentation, multi-agent consensus, and explicit reflection/critiquing consistently boost accuracy.
  • Even top models retain significant error rates and require interpretability and safety scaffolds before clinical deployment.

6. Multimodal and Multispan Extensions

Recent work pushes MedQA tasks beyond text-only MCQ:

  • QuestionIng (Saha et al., 19 May 2026): Introduces a multi-span, multi-modal benchmark requiring answer selection over both text and images, with user intent and query-type labels. The M³QAFrame model fuses BiomedBERT and vision transformers, reaching 94.34% macro F1, well above unimodal and off-the-shelf VLMs.
  • RJUA-MedDQA (Jin et al., 2024): Chinese medical report images + QA pairs (entity, table, numerical, reasoning). With advanced structural annotation (ESRA), ESRA+GPT-4 outperforms best LMMs by >20 points on extractive and reasoning tasks.

7. Current Limitations and Directions

MedQA and its derivatives force models to go beyond memorized recall, demanding authentic, multi-step reasoning under open-domain uncertainty. However:

Active research continues to adapt MedQA-style tasks for richer, clinically meaningful benchmarks (e.g., open-ended skills, explainable reasoning, multimodal QA), closing the gap between academic evaluation and safe, trustworthy medical AI applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedQA.