MedQA & MedMCQA: Medical MCQA Benchmarks
- The paper introduces MedQA and MedMCQA as large-scale MCQA benchmarks that assess LLMs on clinical reasoning and medical reading comprehension.
- MedQA sources USMLE-style questions while MedMCQA aggregates Indian exam MCQs, providing diverse data for simulating real-world clinical scenarios.
- Advanced strategies like chain-of-thought, multi-agent ensembles, and retrieval-augmented generation have driven significant accuracy gains in these benchmarks.
MedQA and MedMCQA are large-scale, multiple-choice question-answering (MCQA) benchmarks widely used to evaluate machine reading comprehension and medical reasoning in LLMs within the healthcare domain. These resources have driven advances in medical NLP, enabling systematic assessment and ablation of retrieval-augmented models, in-context learning, prompt engineering, ensemble reasoning, and cross-lingual generalization for clinical applications.
1. Dataset Origins, Structure, and Coverage
MedQA is rooted in United States Medical Licensing Examination (USMLE)-style board questions. Its canonical format consists of a clinical vignette (case stem) followed by 4–5 answer choices. The dataset comprises approximately 12,723 multiple-choice items. Of these, only 300 are extracted verbatim from official USMLE tutorials; the remainder are sourced from commercial exam-preparation websites, covering domains from anatomy and pathophysiology to ethics, pharmacology, and clinical management (Alwakeel et al., 11 Jul 2025, Singhal et al., 2022). Each question is open-domain, generally divorced from structured supporting evidence, without validated explanations or explicit authoring metadata.
MedMCQA aggregates >193,000 MCQs from Indian postgraduate entrance exams (AIIMS/NEET-PG) and high-quality mock resources (Pal et al., 2022). Each question has a stem and four distractor options, with topics spanning 21 medical subjects and ~2,400 healthcare themes (internal medicine, surgery, pharmacology, radiology, etc.). Items are length-limited (mean stem ≈12.7 tokens), with each accompanied by a human-written explanation and exhaustive topic labeling. To ensure exam realism, MedMCQA uses exam-based splits (separating real AIIMS-PG/NEET-PG items from mocks) and near-duplicate filtering (Levenshtein threshold <0.9 between train vs. dev/test splits) (Pal et al., 2022).
Comparative Table of Core Properties
| Dataset | # Questions | Source(s) | Choices per Q | Subjects / Topics | Explanation | Public Split Method |
|---|---|---|---|---|---|---|
| MedQA | ~12,723 | USMLE tutorials + test-prep | 4–5 | Broad/mixed | No | Random |
| MedMCQA | ~193,155 | AIIMS-PG/NEET-PG + mock | 4 | 21 / ~2,400 | Yes | Exam-based, no overlaps |
MedMCQA is structurally richer, with explicit explanations and subject/topic labeling, while MedQA offers higher-vignette, clinical-style diversity (Pal et al., 2022, Ferrazzi et al., 5 Dec 2025).
2. Benchmarking Protocols and Evaluation Metrics
Model evaluation on MedQA and MedMCQA overwhelmingly uses accuracy as the primary endpoint:
(Singhal et al., 2022, Alwakeel et al., 11 Jul 2025)
Rarely, precision, recall, and F₁-score are reported:
(Alwakeel et al., 11 Jul 2025, Singhal et al., 2022)
Prompting strategies include zero-shot, few-shot, chain-of-thought (CoT), self-consistency voting, and ensemble methods (Maharjan et al., 29 Feb 2024, Elshaer et al., 16 Oct 2025). Some frameworks, such as OpenMedLM, systematically vary prompt templates (random vs. kNN-selected examples, CoT explanations, ensemble voting) to optimize performance (Maharjan et al., 29 Feb 2024). More specialized approaches leverage confidence-aware routing, multi-model aggregation, and modular collaboration among expert LLMs (Elshaer et al., 16 Oct 2025, Mishra et al., 11 Aug 2025).
3. Modeling Paradigms: Baselines, Ensembles, and Retrieval-Augmented Systems
3.1 Foundation and Fine-Tuned LLMs
OpenMedLM establishes open-source SOTA on both MedQA and MedMCQA (e.g., Yi-34B achieves 72.6% and 68.3% accuracy, respectively), primarily via prompt-engineering and voting strategies, not additional fine-tuning (Maharjan et al., 29 Feb 2024). Clinical Camel (13B/70B), based on LLaMA-2 with dialogue-based knowledge encoding and QLoRA adapters, achieves 60.7% (MedQA) and 54.2% (MedMCQA) in 5-shot evaluation, surpassing GPT-3.5 in both settings (Toma et al., 2023).
3.2 Retrieval-Augmented Generation (RAG)
AMG-RAG incorporates an agentic, dynamically updated medical knowledge graph alongside external retrieval (PubMed, Wikipedia), propelling an 8B-parameter LLM to 74.1% F1 on MedQA and 66.3% accuracy on MedMCQA (Rezaei et al., 18 Feb 2025). MedTrust-RAG enforces citation-aware reasoning with iterative retrieval-verification and hallucination-aware alignment, delivering up to +4.2 pp over standard RAG with 8B models (Ning et al., 16 Oct 2025).
Variational Open-Domain QA (VOD) applies a Rényi variational bound to jointly optimize both retriever and reader, yielding 62.9% on MedMCQA—outperforming domain-tuned Med-PaLM (540B) with 2,500× fewer parameters—and 55.0% on MedQA-USMLE with transfer learning (Liévin et al., 2022).
3.3 Multilingual and Reasoning-Traces Pipelines
Recent research extends MedQA/MedMCQA to Italian/Spanish via machine translation and dense retrieval over medical Wikipedia, generating 500k+ grounded reasoning traces (Ferrazzi et al., 5 Dec 2025). Fine-tuning on these traces produces +2–5% accuracy gains on both datasets for 8B-parameter LLMs compared to prior open baselines.
4. Multi-Agent, Ensemble, and Consensus-Based Architectures
Consensus Mechanism frameworks (e.g., Sully MedCon-1) deploy multiple specialist LLM “experts” and an aggregation agent to improve both calibration and accuracy. Ensemble logic—weighted log-opinion pools, softmax-style normalization, and boosting by frequent-rank—yields 96.8% on MedQA and 94.2% on MedMCQA in 500-Q samples, surpassing top single-model systems (O3, Gemini 2.5 Pro) by up to 9.1% (2505.23075).
TeamMedAgents uses agentic teamwork innovations (team leadership, mutual performance monitoring, trust, shared models, orientation, closed-loop communication) for modular LLM collaboration (Mishra et al., 11 Aug 2025). On MedQA, the best configuration (n=3, tailored teamwork) achieves 92.6±2.7% (17.6 pp over single-agent GPT-4o); on MedMCQA, 82.4±2.0% (5.4 pp gain). Ablation reveals that “mutual trust” and “team orientation” drive the majority of performance improvements on these benchmarks.
CURE introduces confidence-driven adaptive routing to selectively invoke helper LLMs for low-confidence items, leading to accuracy gains (MedQA: 74.1%, MedMCQA: 78.0%) without any fine-tuning or additional retrieval (Elshaer et al., 16 Oct 2025). The strategy is computationally efficient and robust to answer-variance relative to uniform ensemble voting.
5. Performance, State-of-the-Art, and Comparative Benchmarks
Reported results across model families are summarized below, with metrics and reference points as published in the source literature.
| Method/Model | MedQA (%) | MedMCQA (%) | Notable Features | Reference |
|---|---|---|---|---|
| OpenMedLM (Yi-34B) | 72.6 | 68.3 | kNN-CoT + self-consistency voting | (Maharjan et al., 29 Feb 2024) |
| AMG-RAG (8B) | 73.9 | 66.3 | Med KG + Ext. retrieval + CoT | (Rezaei et al., 18 Feb 2025) |
| Clinical Camel (70B/5-shot) | 60.7 | 54.2 | DBKE, LoRA fine-tune | (Toma et al., 2023) |
| Med-PaLM 2 (proprietary) | 79.7 | 71.3 | Instruction tuning, large-scale SFT | (Toma et al., 2023) |
| Flan-PaLM 540B SC | 67.6 | 57.6 | Few-shot + 11x CoT samples (vote) | (Singhal et al., 2022) |
| VOD (BioLinkBERT) | 55.0 | 62.9 | Variational retriever/reader | (Liévin et al., 2022) |
| LLaMA3.1-8B + DPO | 62.3 | 57.5 | Hallucination-aware RAG+alignment | (Ning et al., 16 Oct 2025) |
| Qwen3-8B FT + trace | 76.7 | 67.6 | Multilingual reasoning traces | (Ferrazzi et al., 5 Dec 2025) |
| CURE (Qwen3-30B+help) | 74.1 | 78.0 | Confidence-driven ensemble CoT | (Elshaer et al., 16 Oct 2025) |
| TeamMedAgents | 92.6±2.7 | 82.4±2.0 | Structured teamwork, agentic ensemble | (Mishra et al., 11 Aug 2025) |
| Sully MedCon-1 (ensemble) | 96.8 | 94.2 | Weighted specialty consensus system | (2505.23075) |
Note: Datasets, sample sizes, and evaluation splits may differ between studies. Highest reported values are for controlled subsamples using cross‐ensemble architectures.
6. Critical Appraisal, Limitations, and Recommendations
Transparency, Realism, and Contamination: Both MedQA and MedMCQA are subject to unresolved provenance issues. MedQA’s origin is largely from commercial test-prep, with only 2% drawn from official sources, and the remainder lack documentation on review or validation (Alwakeel et al., 11 Jul 2025). MedMCQA is claimed to be composed of genuine Indian exam items, but the legal chain of custody remains uncertain, and permission is ambiguous. Neither dataset documents rigorous multi-round clinical expert panel reviews, psychometric validation, or evidence-based rationale vetting (Alwakeel et al., 11 Jul 2025). Brief stems and lack of contextual richness (notably in MedMCQA, mean stem length ~12 tokens) constrain the depth of clinical reasoning tested.
Metric Narrowness: Nearly all studies report only end-to-end accuracy; few address error stratification, calibration, or partial credit scoring. Reliability in real clinical settings would require calibration analysis, fidelity of reasoning (explanation-grounding), and assessment of domain transfer (e.g., breakdown by specialties such as neurology, pharmacology, etc.) (Alwakeel et al., 11 Jul 2025).
Risk of Data Leakage and Overestimated Performance: Many items are accessible via public web sources, increasing the likelihood of contamination in both supervised and foundation LLM pretraining (Alwakeel et al., 11 Jul 2025, Singhal et al., 2022).
Proposals for Improvement:
- Adopt standardized item-writing protocols: multi-clinician committees, round-tripped expert review, psychometric benchmarking.
- Embrace RAG and synthetic question generation with explicit paraphrasing to emulate multi-morbidity and authentic clinical complexity, followed by expert curation.
- Expand reporting to include explanation fidelity, confidence estimation, and stratified performance metrics.
- Construct embargoed, version-tracked benchmarks distributed under professional society or regulatory oversight (Alwakeel et al., 11 Jul 2025).
7. Impact and Future Directions
MedQA and MedMCQA have catalyzed substantial advances in medical LLM research:
- They have served as the principal source for the development and evaluation of end-to-end, retrieval-augmented, multi-lingual, and ensemble-based clinical QA systems.
- Their constraints have motivated new architectures robust to input truncation, route-based ensembles, and explainable reasoning chains.
- Limitations in transparency have also driven research on dataset curation protocols, hallucination mitigation, and evaluation beyond single-answer accuracy.
Active research targets include dataset diversification, increased clinical fidelity, multilingual expansion, calibration and uncertainty quantification, and benchmarking with real-world, patient-centered clinical tasks (Ferrazzi et al., 5 Dec 2025, Singhal et al., 2022).
References:
(Pal et al., 2022, Liévin et al., 2022, Singhal et al., 2022, Toma et al., 2023, Maharjan et al., 29 Feb 2024, Rezaei et al., 18 Feb 2025, 2505.23075, Alwakeel et al., 11 Jul 2025, Mishra et al., 11 Aug 2025, Elshaer et al., 16 Oct 2025, Ning et al., 16 Oct 2025, Ferrazzi et al., 5 Dec 2025, Thiprak et al., 13 Sep 2024)