Med-PaLM 2: Advanced Medical QA LLM

Updated 4 April 2026

Med-PaLM 2 is a large language model tailored for medical Q&A, using domain-specific fine-tuning and ensemble refinement prompting to achieve expert-level performance.
It builds on PaLM 2 improvements like multilingual pretraining and adaptive optimization, setting state-of-the-art benchmarks on MedQA, PubMedQA, and more.
Human evaluations show Med-PaLM 2’s outputs are often preferred over physician answers, highlighting its promise for clinical decision-support applications.

Med-PaLM 2 is a LLM specifically adapted for medical question answering, representing an advance in expert-level performance using domain-specific alignment and prompting strategies. Building on progress achieved by the original Med-PaLM, Med-PaLM 2 integrates improvements in foundation models, instruction fine-tuning with curated medical datasets, and a novel ensemble refinement prompting scheme. Evaluated on the MultiMedQA benchmark suite and extensive human assessments, Med-PaLM 2 demonstrates state-of-the-art performance, approaching or exceeding comparison against both prior models and physician-generated answers in several clinically relevant metrics (Singhal et al., 2023).

1. Foundation Model and Medical Domain Adaptation

Med-PaLM 2 leverages PaLM 2 as its backbone model. PaLM 2 introduces several enhancements over its predecessor PaLM: a broader multilingual pretraining corpus, augmented with scientific articles and code, and architectural improvements including refined layer scaling, activation functions, and adaptive optimizers. These changes yield better zero- and few-shot capabilities on standard LLM benchmarks.

To specialize PaLM 2 for medical reasoning and knowledge, domain-specific instruction fine-tuning is performed. This process follows the methodology of Chung et al. (2022), yielding a unified Med-PaLM 2 model that supports both multiple-choice and long-form answer generation. The instruction fine-tuning mixture comprises medical exam and consumer question datasets as illustrated below:

Dataset	Examples	Mixture Proportion
MedQA (USMLE style)	10,178	37.5%
MedMCQA (Indian exam)	182,822	37.5%
LiveQA (consumer queries)	10	3.9%
MedicationQA (drug queries)	9	3.5%
HealthSearchQA (search)	45	17.6%

Fine-tuning minimizes the standard cross-entropy loss function $\mathcal{L} = -\sum_{(x,y)}\log p_\theta(y\mid x)$ using Transformer-based hyperparameters (AdamW optimizer, learning rate warm-up and decay, TPU-optimized batch sizes), until validation plateaus.

Med-PaLM 2 introduces Ensemble Refinement (ER), a two-stage prompting method designed to improve LLM reliability in medical question answering, particularly for multiple-choice formats. The process entails:

Stage 1: Ensemble Generation

Given a question $q$ and a few-shot chain-of-thought (CoT) prompt, generate $n_1$ stochastic outputs $\{r_i\}_{i=1}^{n_1}$ , each accompanied by reasoning and answer.

Stage 2: Refinement

Concatenate the original prompt, $q$ , and all first-stage outputs, then ask the model to generate a refined response. Repeat $n_2$ times, producing $\{s_j\}_{j=1}^{n_2}$ .
Select the final answer $\hat{a}$ by plurality vote among the $n_2$ outputs:

$\hat{a} = \mathop{\mathrm{arg\,max}}_{a\in\mathcal{A}} \sum_{j=1}^{n_2}\mathbf{1}[s_j \text{ ends with answer }a].$

Default values are $q$ 0, $q$ 1. This approach surpasses plain CoT and self-consistency, particularly on complex questions.

3. Evaluation Frameworks and Benchmarks

Med-PaLM 2 is evaluated using both automatic and human protocols. Automatic assessment employs the MultiMedQA suite:

Benchmark (Items)	Med-PaLM 2 (ER) Accuracy
MedQA (1,273)	85.4% (unified), 86.5% (MedQA-only)
MedMCQA (4,183)	72.3%
PubMedQA (500)	81.8% (self-consistency)
MMLU Clinical Topics	88.7%
MMLU Medical Genetics	92.0%
MMLU Anatomy	84.4%
MMLU Prof. Medicine	95.2%
MMLU College Biology	95.8%
MMLU College Medicine	83.2%

Med-PaLM 2 attains state-of-the-art performance on MedQA (+19.0% over Med-PaLM), and shows parity or improvement over GPT-4-base for key medical benchmarks. Overlap analysis indicates test set memorization is not a significant factor; accuracy differences between overlapping and non-overlapping test instances are minor (e.g., $q$ 2 on MedMCQA with 95% CI $q$ 3).

4. Human Evaluation Protocols

Human evaluation includes rigorous long-form answer scoring and pairwise physician preference assessments.

Long-Form Ratings:

Datasets: MultiMedQA 140 (140 consumer medical questions), Adversarial General (58), Health Equity (182).
Each answer is rated by 3–4 physicians over 12 axes (e.g., medical consensus, content coverage, harm likelihood, demographic bias). Randolph’s $q$ 4 exceeds 0.6 for all axes, and $q$ 5 for the majority.

Pairwise Rankings:

Physicians compare Med-PaLM 2 and physician answers on 1,066 consumer and 240 adversarial questions by nine axes such as alignment with medical consensus, recall, reasoning, omission, irrelevance, bias, harm extent, and harm likelihood.
Preferences are statistically analyzed using nonparametric bootstrap (10,000 resamples) and two-tailed permutation tests blocked by question; significance at $q$ 6.

Med-PaLM 2 is preferred over physician-written answers on eight out of nine axes related to clinical utility ( $q$ 7).

5. Key Results and Analysis

Med-PaLM 2 achieves notable advances in medical QA performance:

MedQA: 86.5%; exceeds Med-PaLM by 19%, SOTA on the dataset.
PubMedQA: 81.8% (human: 78.0%).
MedMCQA: 72.3%.
Multiple MMLU medical subtopics up to 95.8%.

In human evaluations, Med-PaLM 2 improves over its predecessor on all axes for standard and adversarial questions ( $q$ 8). In blinded, pairwise preference, physicians select Med-PaLM 2’s responses more often than those written by physicians themselves on the majority of axes. These results underscore substantial progress toward physician-level accuracy and clinical relevance for LLM-driven question answering in medicine.

6. Limitations and Open Directions

Despite these results, Med-PaLM 2 has several articulated limitations:

Real-world clinical validation remains outstanding; further deployment-oriented studies are needed.
More extensive evaluation in richer, multi-turn dialogue settings is necessary.
Expanded adversarial and equity-focused assessments are required.
Full characterization of physician benchmarking protocols is not yet achieved.

Advancements in factuality, alignment, and human preference metrics highlight the pace of progress in domain-aligned LLMs. A plausible implication is the model’s potential for integration into decision-support systems, contingent on future clinical validation and continued scrutiny regarding safety, equity, and generalization across diverse medical contexts (Singhal et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Towards Expert-Level Medical Question Answering with Large Language Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Med-PaLM 2.

Med-PaLM 2: Advanced Medical QA LLM

1. Foundation Model and Medical Domain Adaptation

2. Ensemble Refinement Prompting

3. Evaluation Frameworks and Benchmarks

4. Human Evaluation Protocols

5. Key Results and Analysis

6. Limitations and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Med-PaLM 2: Advanced Medical QA LLM

1. Foundation Model and Medical Domain Adaptation

2. Ensemble Refinement Prompting

3. Evaluation Frameworks and Benchmarks

4. Human Evaluation Protocols

5. Key Results and Analysis

6. Limitations and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics