HealthBench: Benchmark for Medical LLMs

Updated 20 November 2025

HealthBench is an open-source, physician-vetted benchmark that measures a wide spectrum of clinical reasoning, communication quality, and uncertainty handling in medical LLMs.
The dataset comprises 5,000 multi-turn clinical dialogues across 26 specialties and 49 languages, each annotated with detailed rubric criteria and metadata.
HealthBench introduces specialized subsets and automated GPT-4 grading to test model robustness, despite noted limitations in global guideline alignment and multi-turn reasoning.

HealthBench is an open-source, physician-vetted benchmark for evaluating LLMs in medical and clinical contexts. Comprising 5,000 realistic, multi-turn dialogues spanning 26 specialties and 49 languages, HealthBench is distinguished by its rubric-driven, behavior-level annotation framework. Unlike conventional multiple-choice or recall-based datasets, it measures a spectrum of advanced competencies, such as contextual reasoning, uncertainty handling, and communication quality, all underpinned by explicit, scenario-specific criteria authored by a global physician cohort (Arora et al., 13 May 2025, Mutisya et al., 31 Jul 2025, Ravichandran et al., 29 Aug 2025, Hisada et al., 22 Sep 2025).

1. Dataset Composition and Structure

HealthBench contains 5,000 physician-refined clinical conversations, each annotated with a mean of 11–12 rubric criteria, totaling approximately 48,562 unique checklist items across the entire corpus. Dialogues are both single- and multi-turn (mean 2.6 turns; range 1–19), with each utterance role-tagged (User or Model) and supplemented by rich metadata for theme, subcategory, degree of urgency, uncertainty, and information sufficiency.

The dataset’s scope encompasses 26 medical specialties (e.g., emergency medicine, pediatrics, infectious disease) and covers seven thematic categories:

Emergency referrals
Expertise-tailored communication
Responding under uncertainty
Response depth
Health data tasks
Global health
Context seeking

The language distribution is skewed, with over 50% of examples in Western European languages; however, the annotation protocol was opened to contributors from 60 countries (Arora et al., 13 May 2025, Mutisya et al., 31 Jul 2025).

Dialogues are derived from three principal data streams:

High-stakes, physician-drafted scenarios
Adversarial “red-team” prompts probing known model failure modes
HealthSearchQA: consumer queries re-cast as conversations

Metadata tags operationalize difficulty with gradations such as high-urgency (e.g., sepsis), medium-urgency (routine follow-up), and low-urgency or informational advice (preventive counseling) (Mutisya et al., 31 Jul 2025, Arora et al., 13 May 2025).

2. Annotation, Rubric Design, and Scoring Framework

A network of 262 vetted physicians authored the dataset and designed context-specific rubrics over an 11-month period, using iterative scenario proposal and editing to ensure clinical fidelity. Each rubric comprises multiple checklist criteria formulated in clinical language, targeting both desired and undesirable behaviors. Rubric criteria explicitly factor into five behavioral axes:

Accuracy
Completeness
Context awareness
Communication quality
Instruction following

Criteria are scored on a –10 … +10 scale; additive points are granted for fulfilled desiderata, and penalties are imposed for harmful behavior or omissions. Roughly 13% of items underwent double review by consensus panels, with the rest singly authored (Arora et al., 13 May 2025, Mutisya et al., 31 Jul 2025).

Scoring operates as follows. For each conversation $i$ with $M_i$ criteria:

$s_i = \frac{ \sum_{j=1}^{M_i} r_{ij} \cdot p_{ij} }{ D_i }$

where $r_{ij}$ is an indicator for criterion $j$ being met, $p_{ij}$ is the criterion point value, and $D_i = \sum_{j=1}^{M_i} \max(0, p_{ij})$ . Aggregate HealthBench score:

$S = \operatorname{clip}\left( \frac{1}{N}\sum_{i=1}^N s_i, 0,1 \right)$

Axis-level, theme-level, and subset analyses (see below) proceed similarly, based on per-axis or per-criterion grouping. Scoring code and JSON schemas are publicly provided (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025).

3. Benchmark Variations and Subsets

HealthBench features specialized subsets to probe specific model competencies and failure modes:

HealthBench Consensus: 34 consensus criteria of high clinical impact, validated via multi-physician agreement, filtered to 3,671 examples where at least one such criterion is assigned. Used to compute model error rates on critical behaviors (e.g., emergency referrals).
HealthBench Hard: 1,000 most challenging cases, empirically selected using low mean scores of frontier LLMs. Functions as a stress-test for clinical complexity and ambiguity (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025).

Each variant is directly accessible via open-source platform APIs (e.g., simple-evals), with evaluation scripts provided.

4. Limitations, Biases, and Regional Adaptation

While HealthBench advances model evaluation, several limitations have been documented:

Evidence Pyramid Inversion: Rubric criteria are sourced from individual clinician opinion rather than systematic reviews or guideline-based evidence, risking solidification of local practice biases.
Geospatial and Domain Gaps: Neglected tropical diseases (e.g., HIV, malaria, schistosomiasis) are under-represented (<3% each). Guidelines and rubrics are predominantly based on US/UK standards, frequently mismatching LMIC practices and legal or resource constraints. Immunization rubrics, for instance, specify Western vaccination schedules misaligned with actual practice in Kenya and other LMICs.
Single-turn Dominance: A preponderance of single-turn dialogues insufficiently challenges multi-turn reasoning or memory consistency.
Static Snapshot Risk: Rubrics may become outdated as guidelines evolve (e.g., COVID-19 protocols), leading to optimization against obsolete practices (Mutisya et al., 31 Jul 2025).

Localization studies, such as the Japanese adaptation (“J-HealthBench”), quantified contextual misalignment: 14.3% of scenario/rubric pairs required adjustment and 2.7% were inapplicable to Japanese guidelines. Performance of frontier multilingual models (e.g., GPT-4.1) drops modestly under machine-translated rubrics (Δ_EN→JP = 0.033), but native Japanese LLMs failed to meet completeness targets (overall score –0.062), demonstrating both the challenge and necessity of context-specific adaptation (Hisada et al., 22 Sep 2025).

5. Automated Grading and Reliability

An automated GPT-4-based grader applies rubrics at scale, with spot-checked human auditing (5–10% of samples) for quality control. Identified issues with automated grading include model bias reinforcement, hallucinated judgements, codification of individual annotator idiosyncrasies, and sensitivity to architectural mismatch (grading should ideally be model-agnostic) (Mutisya et al., 31 Jul 2025).

Meta-evaluation metrics are computed to assess grading reliability (F1 score, macro-F1 across classes), and “worst-at-k” statistics quantify the reliability of model outputs under response sampling regimes (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025).

6. Proposed Advances: Evidence-Robust and Ethical Benchmarking

Recent work recommends remedying evidence-tier and regionalization deficits by anchoring rubric criteria in version-controlled Clinical Practice Guidelines (CPGs) and GRADE ratings, enabling:

Rubric-to-CPG Linkage: Direct mapping of each criterion to persistent guideline identifiers (e.g., “WHO-Sepsis-2023-Rec-1.2”), machine-readable transformations (FHIR CQL), and transparent traceability.
Evidence-Weighted Scoring: Assigning weights to rubric items proportional to evidence quality (e.g., +3 for high-quality meta-analyses, +1 for consensus/expert opinion), providing total scores as

$S_{\text{total}} = \sum_{i=1}^n w_i \cdot s_i$

where $w_i$ corresponds to GRADE tier.

Reward Function Contextualization: Incorporating resource availability, patient comorbidities, and override logic for justified guideline deviation, thereby embedding penalty attenuation and equity guardrails.
Outcome-Linked Feedback and Ethics: Integrating real-world outcomes (readmission rate, morbidity) into scoring; formal requirements for public audit trails, explainability, and compliance with privacy laws, particularly in LMIC contexts.
Continuous Dataset Maintenance: Scheduled versioning, dynamic response to guideline updates, and systematic curation of scenario/case portfolio to maintain clinical relevance (Mutisya et al., 31 Jul 2025).

7. Access, Tooling, and Community Adoption

HealthBench is freely available under an open-source license at github.com/openai/simple-evals, with data in JSON and code for loading, rubric application, and metric computation. A higher-level API (pip install healthbench-eval) supports per-subset analysis and script-based benchmarking (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025).

Licensing prohibits commercial use without explicit consent and encourages citation of primary research (Ravichandran et al., 29 Aug 2025). Canary strings and private held-out sets help monitor for training data leakage and benchmark overfitting.

HealthBench has become the principal standard for holistic medical LLM evaluation, providing substantial longitudinal insight into model progress. For example, GPT-4o and its smaller variants (GPT-4.1 nano) demonstrated rapid improvements, with the latter outperforming its larger predecessor while reducing inference cost by 25x. The consensus error rate for critical behaviors (e.g., emergency referrals) has improved >4-fold in two years (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025).

A plausible implication is that HealthBench, with continued evidence-tiering, localization, and outcome integration, will increasingly serve not only as an evaluative resource but as a blueprint for regulatory-grade, globally relevant AI safety benchmarks in healthcare.