MedMCQA Dataset: Indian Medical MCQs
- MedMCQA is a large-scale MCQA dataset compiled from Indian medical entrance exams with over 190K real-world questions.
- It features detailed metadata, expert explanations, and annotations for 12 distinct reasoning types, supporting diverse clinical tasks.
- Baseline models achieve up to 62.9% test accuracy, emphasizing the dataset’s challenge and value for advancing medical AI research.
MedMCQA is a large-scale, multi-subject, multiple-choice question answering (MCQA) dataset curated from Indian medical entrance examinations. It has been developed to benchmark and promote research in automated medical reasoning, especially for natural language understanding and knowledge-grounded medical QA across a broad spectrum of clinical subjects and reasoning abilities.
1. Dataset Scope and Structure
MedMCQA comprises 193,155–194,000+ real-world MCQs drawn from AIIMS and NEET-PG entrance examinations and related mock series since 1991. Each instance is structured as a tuple , where is the stem and are four candidate answers, with a single correct label per sample. Coverage extends to 21 high-level medical subjects and approximately 2,400 fine-grained topics, including but not limited to Medicine, Surgery, Pediatrics, Ophthalmology, Biochemistry, and ENT.
Each sample further includes metadata:
- id: unique identifier
- subject: one of 21 subjects
- topic: fine-grained medical topic
- question: stem
- options: list of four candidate answers (mean length ≈2.7 tokens)
- answer: index of correct answer
- explanation: expert-provided justification (mean 66.2 tokens), critical for interpretability and error attribution
The average question length is 12.77 tokens for training, with slight variation across splits (dev: 14.09, test: 9.93). The vocabulary comprises 97,694 unique tokens.
2. Data Collection, Curation, and Annotation Protocols
Source data was compiled from:
- AIIMS PG exams (1991–present)
- NEET PG exams (2001–present)
- Curated mock tests and questions authored by medical professionals
Selection and curation enforce strict quality control:
- Only items with exactly four answer choices and a single best answer key are retained
- Items requiring interpretation of images, tables, or external media are excluded
- Blacklisting removes questions referencing sensitive context or vague formulations
- Automatic and human-in-the-loop normalization and standardization correct typos and formatting issues
- Duplicates are removed via exact match, and near-duplicate leakage across splits is controlled using Levenshtein similarity
Preprocessing removes any personal or sensitive data; only purely textual data survives. Ongoing work includes the annotation of reasoning types.
3. Reasoning Taxonomy and Diversity
A key attribute is the explicit annotation of reasoning abilities for ≈25% of samples. Twelve distinct reasoning types are defined, including factual recall, multi-hop inference, treatment selection, diagnosis, explanation/definition selection, mathematical reasoning, analogy, comparison, NLI, and distractor elimination. The distribution of reasoning abilities ensures coverage from simple fact lookup to sophisticated clinical inference.
Examples (editor-selected for illustration):
- Factual recall: "Which drug is a selective COX-2 inhibitor?"
- Treatment selection: "A 10-year-old boy with sensorineural deafness and no improvement with conventional aids. Most appropriate management?"
- Mathematical reasoning: "A population study shows mean glucose = 86 mg/dL with normal distribution. What percent of people have glucose above 86 mg/dL?"
- Multi-hop diagnosis/treatment: Chaining clinical context and pharmacology guidelines
4. Dataset Splits and Leakage Avoidance
Splits are exam-based to obviate contamination from near-duplicate or semantically similar items:
- Train: ~182,822 items, dominated by mocks and online series
- Dev: 4,183 questions from recent NEET PG exams
- Test: 6,150 questions from historical AIIMS PG exams
Splitting by temporal and institutional source minimizes overlap. Pairwise Levenshtein similarity across splits is precluded to enforce independence in evaluation.
5. Baseline Model Performance and Contextualization
The original dataset paper (Pal et al., 2022) benchmarks transformer-based models across three context conditions (no context, Wikipedia, PubMed):
- BERT Base (no context): 33%
- BioBERT (no context): 37%
- SciBERT (no context): 39%
- PubMedBERT (PubMed context): 47% By comparison, "human merit candidates average ≃90%" accuracy.
Topical error analysis shows model failures primarily in multi-hop reasoning, context retrieval errors, and quantitative subproblems, attributing most difficulty to the breadth and compositionality required.
6. Advances in Modeling: Methods and Results
Several recent works have leveraged MedMCQA as a benchmark for scalable automatic medical QA:
- MQ-SequenceBERT (Ponce-López, 2024): Fine-tunes Sentence-BERT with mean-pooling and classification for 21-way subject classification (without external context), reaching 0.68/0.60 dev/test accuracy. The architecture involves standard tokenization, mean + L2 pooling, and a single linear classifier with softmax and cross-entropy loss. No data augmentation or regularization beyond base fine-tuning. Results are competitive or superior, with the dev set 5 points above Codex (CoT) and on par for test, indicative of effective non-generative, discriminative subject classification.
- Variational Open-Domain (VOD) QA (Liévin et al., 2022): Introduces an end-to-end variational framework for retrieval-augmented MCQA, combining a BioLinkBERT dual-encoder retriever and a cross-attention reader, trained with a Rényi variational bound and importance sampling. This model achieves 62.9% test accuracy, outperforming much larger LLM baselines such as Med-PaLM (+5.3%) using ~2500 fewer parameters. The framework is distinctive in optimizing both retriever and reader jointly, with support set truncation and task-adapted semantic representations, as shown by improved MRR/Hit@20 metrics on downstream rare-disease retrieval.
- Comparison with strong baselines: Codex (5-shot CoT, 175B) achieves 62.7% test; Med-PaLM (540B) yields 57.6%, highlighting MedMCQA's continued relevance for comparing large-scale parametric and retrieval-based methods.
Summarized Test Accuracy Table:
| Method | Test Accuracy |
|---|---|
| PubMedBERT + DPR | 47.0% |
| BioLinkBERT + BM25 | 55.3% |
| Codex 5-shot CoT (175B) | 62.7% |
| Med-PaLM (540B) | 57.6% |
| VOD (2×BioLinkBERT, 220M) | 62.9% |
| MQ-SequenceBERT | 0.60 |
7. Access, Licensing, and Benchmark Role
MedMCQA is publicly available under the CC BY-NC-SA license at https://medmcqa.github.io, distributed as JSON objects with fields for splitting, explanations, and annotation metadata. It remains among the largest open-domain, multi-subject MCQA datasets with rich annotation, fine-grained topics, detailed expert explanations, and robust exam-based splits that minimize leakage and maximize evaluation fidelity. Its broad coverage, annotation depth, and standardized splits make it foundational for progress in medical AI, both for subject classification and full-answer retrieval/justification tasks.
8. Comparative Position and Future Directions
Relative to contemporaries such as MedQA (270k randomized, no subject annotation) and HEAD-QA (13.5k; 6 clinical specialties), MedMCQA distinguishes itself through its focus on multi-subject coverage, real-world exam provenance, granularity of reasoning labels, and structured expert explanations. Its design directly supports evaluation of both generative and discriminative models, and retrieval-augmented approaches.
Ongoing and future directions highlighted include extending annotation coverage (reasoning types), leveraging context-enhanced retrieval (not yet systematically explored for subject classification), and exploring transfer from MCQA to decision-support or clinical inference. Its high difficulty ceiling and linguistic diversity ensure enduring utility for robust benchmarking of LLMs in the biomedical domain.
References:
MedMCQA dataset introduction and benchmarks (Pal et al., 2022) MQ-SequenceBERT for medical question classification (Ponce-López, 2024) VOD framework for retrieval-augmented medical MCQA (Liévin et al., 2022)