Papers
Topics
Authors
Recent
Search
2000 character limit reached

Med-PaLM 2: Advanced Medical QA LLM

Updated 4 April 2026
  • Med-PaLM 2 is a large language model tailored for medical Q&A, using domain-specific fine-tuning and ensemble refinement prompting to achieve expert-level performance.
  • It builds on PaLM 2 improvements like multilingual pretraining and adaptive optimization, setting state-of-the-art benchmarks on MedQA, PubMedQA, and more.
  • Human evaluations show Med-PaLM 2’s outputs are often preferred over physician answers, highlighting its promise for clinical decision-support applications.

Med-PaLM 2 is a LLM specifically adapted for medical question answering, representing an advance in expert-level performance using domain-specific alignment and prompting strategies. Building on progress achieved by the original Med-PaLM, Med-PaLM 2 integrates improvements in foundation models, instruction fine-tuning with curated medical datasets, and a novel ensemble refinement prompting scheme. Evaluated on the MultiMedQA benchmark suite and extensive human assessments, Med-PaLM 2 demonstrates state-of-the-art performance, approaching or exceeding comparison against both prior models and physician-generated answers in several clinically relevant metrics (Singhal et al., 2023).

1. Foundation Model and Medical Domain Adaptation

Med-PaLM 2 leverages PaLM 2 as its backbone model. PaLM 2 introduces several enhancements over its predecessor PaLM: a broader multilingual pretraining corpus, augmented with scientific articles and code, and architectural improvements including refined layer scaling, activation functions, and adaptive optimizers. These changes yield better zero- and few-shot capabilities on standard LLM benchmarks.

To specialize PaLM 2 for medical reasoning and knowledge, domain-specific instruction fine-tuning is performed. This process follows the methodology of Chung et al. (2022), yielding a unified Med-PaLM 2 model that supports both multiple-choice and long-form answer generation. The instruction fine-tuning mixture comprises medical exam and consumer question datasets as illustrated below:

Dataset Examples Mixture Proportion
MedQA (USMLE style) 10,178 37.5%
MedMCQA (Indian exam) 182,822 37.5%
LiveQA (consumer queries) 10 3.9%
MedicationQA (drug queries) 9 3.5%
HealthSearchQA (search) 45 17.6%

Fine-tuning minimizes the standard cross-entropy loss function L=(x,y)logpθ(yx)\mathcal{L} = -\sum_{(x,y)}\log p_\theta(y\mid x) using Transformer-based hyperparameters (AdamW optimizer, learning rate warm-up and decay, TPU-optimized batch sizes), until validation plateaus.

2. Ensemble Refinement Prompting

Med-PaLM 2 introduces Ensemble Refinement (ER), a two-stage prompting method designed to improve LLM reliability in medical question answering, particularly for multiple-choice formats. The process entails:

Stage 1: Ensemble Generation

  • Given a question qq and a few-shot chain-of-thought (CoT) prompt, generate n1n_1 stochastic outputs {ri}i=1n1\{r_i\}_{i=1}^{n_1}, each accompanied by reasoning and answer.

Stage 2: Refinement

  • Concatenate the original prompt, qq, and all first-stage outputs, then ask the model to generate a refined response. Repeat n2n_2 times, producing {sj}j=1n2\{s_j\}_{j=1}^{n_2}.
  • Select the final answer a^\hat{a} by plurality vote among the n2n_2 outputs:

a^=argmaxaAj=1n21[sj ends with answer a].\hat{a} = \mathop{\mathrm{arg\,max}}_{a\in\mathcal{A}} \sum_{j=1}^{n_2}\mathbf{1}[s_j \text{ ends with answer }a].

Default values are qq0, qq1. This approach surpasses plain CoT and self-consistency, particularly on complex questions.

3. Evaluation Frameworks and Benchmarks

Med-PaLM 2 is evaluated using both automatic and human protocols. Automatic assessment employs the MultiMedQA suite:

Benchmark (Items) Med-PaLM 2 (ER) Accuracy
MedQA (1,273) 85.4% (unified), 86.5% (MedQA-only)
MedMCQA (4,183) 72.3%
PubMedQA (500) 81.8% (self-consistency)
MMLU Clinical Topics 88.7%
MMLU Medical Genetics 92.0%
MMLU Anatomy 84.4%
MMLU Prof. Medicine 95.2%
MMLU College Biology 95.8%
MMLU College Medicine 83.2%

Med-PaLM 2 attains state-of-the-art performance on MedQA (+19.0% over Med-PaLM), and shows parity or improvement over GPT-4-base for key medical benchmarks. Overlap analysis indicates test set memorization is not a significant factor; accuracy differences between overlapping and non-overlapping test instances are minor (e.g., qq2 on MedMCQA with 95% CI qq3).

4. Human Evaluation Protocols

Human evaluation includes rigorous long-form answer scoring and pairwise physician preference assessments.

Long-Form Ratings:

  • Datasets: MultiMedQA 140 (140 consumer medical questions), Adversarial General (58), Health Equity (182).
  • Each answer is rated by 3–4 physicians over 12 axes (e.g., medical consensus, content coverage, harm likelihood, demographic bias). Randolph’s qq4 exceeds 0.6 for all axes, and qq5 for the majority.

Pairwise Rankings:

  • Physicians compare Med-PaLM 2 and physician answers on 1,066 consumer and 240 adversarial questions by nine axes such as alignment with medical consensus, recall, reasoning, omission, irrelevance, bias, harm extent, and harm likelihood.
  • Preferences are statistically analyzed using nonparametric bootstrap (10,000 resamples) and two-tailed permutation tests blocked by question; significance at qq6.

Med-PaLM 2 is preferred over physician-written answers on eight out of nine axes related to clinical utility (qq7).

5. Key Results and Analysis

Med-PaLM 2 achieves notable advances in medical QA performance:

  • MedQA: 86.5%; exceeds Med-PaLM by 19%, SOTA on the dataset.
  • PubMedQA: 81.8% (human: 78.0%).
  • MedMCQA: 72.3%.
  • Multiple MMLU medical subtopics up to 95.8%.

In human evaluations, Med-PaLM 2 improves over its predecessor on all axes for standard and adversarial questions (qq8). In blinded, pairwise preference, physicians select Med-PaLM 2’s responses more often than those written by physicians themselves on the majority of axes. These results underscore substantial progress toward physician-level accuracy and clinical relevance for LLM-driven question answering in medicine.

6. Limitations and Open Directions

Despite these results, Med-PaLM 2 has several articulated limitations:

  • Real-world clinical validation remains outstanding; further deployment-oriented studies are needed.
  • More extensive evaluation in richer, multi-turn dialogue settings is necessary.
  • Expanded adversarial and equity-focused assessments are required.
  • Full characterization of physician benchmarking protocols is not yet achieved.

Advancements in factuality, alignment, and human preference metrics highlight the pace of progress in domain-aligned LLMs. A plausible implication is the model’s potential for integration into decision-support systems, contingent on future clinical validation and continued scrutiny regarding safety, equity, and generalization across diverse medical contexts (Singhal et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Med-PaLM 2.