HealthBench: Clinical LLM Benchmark

Updated 23 February 2026

HealthBench is an open-source, physician-grounded benchmark that evaluates LLM performance using 5,000 expert-annotated multi-turn dialogues across diverse clinical scenarios.
It employs a detailed rubric-based annotation framework with weighted scoring across key behavioral axes such as accuracy, completeness, and context awareness.
HealthBench supports specialized subsets including hard and consensus partitions, aiding incremental model alignment and robust evaluation of clinical AI capabilities.

HealthBench is an open-source, physician-grounded benchmark designed to evaluate the performance, safety, and clinical alignment of LLMs in healthcare contexts. Developed by a consortium of clinicians and researchers, HealthBench advances model assessment beyond conventional multiple-choice question (MCQ) or short-form answer evaluations by leveraging 5,000 open-ended, multi-turn conversations spanning global medical scenarios. Through a fine-grained rubric-based annotation framework and a multidimensional scoring methodology, HealthBench quantitatively captures model behaviors critical to real-world clinical utility, with ongoing adoption as the default leader-board for medical LLM development and alignment.

1. Benchmark Structure, Scope, and Annotation

HealthBench consists of 5,000 conversational examples—each a complete dialogue between a user (patient or clinician) and an LLM. Dialogues are both single- and multi-turn (mean 2.6 turns, median 1, up to 9,853 characters)—providing breadth across consultation types and interaction depths (Arora et al., 13 May 2025).

Each conversation is assigned to one of seven real-world health “themes”:

Theme	Proportion of Examples
Global health	21.9%
Responding under uncertainty	21.4%
Expertise-tailored communication	18.4%
Context seeking	11.9%
Emergency referrals	9.6%
Health data tasks	9.5%
Response depth	7.2%

For every conversation, expert physicians author a set of case-specific rubric criteria (totaling 48,562 unique items), each with a weight in $[-10, +10]$ . Criteria are distributed across five key behavioral axes: Accuracy (33%), Completeness (39%), Context Awareness (16%), Communication Quality (8%), and Instruction Following (4%). This rubric architecture enables granular scoring of diverse LLM outputs along clinically salient dimensions (Arora et al., 13 May 2025).

2. Scoring Methodology and Equations

Scoring in HealthBench is formally defined via physician-authored, criterion-weighted rubrics applied to a model's response. For the $i$ -th example, with $M_i$ criteria, the point value for criterion $j$ is $p_{ij}$ , and binary indicator $r_{ij}=1$ if criterion $j$ is met, 0 otherwise. Define the total positive-point sum as $D_i = \sum_{j=1}^{M_i} \max(0, p_{ij})$ . The normalized item score:

$s_i = \frac{ \sum_{j=1}^{M_i} r_{ij} p_{ij} }{ D_i }$

$s_i$ is then clipped to $i$ 0 to prevent negative normalization, and the aggregate model score is: $i$ 1

Theme- and axis-specific scores are defined by restricting the sum to the relevant criteria or examples (Arora et al., 13 May 2025). This framework applies equally to baseline, consensus, or hard partitions.

For the hard and consensus subsets, performance is typically reported via the same formalism, with error rates $i$ 2 for critical, safety-focused axes (e.g., emergency referrals).

3. Variations: Consensus, Hard, and Localization Adaptations

HealthBench supports two key benchmark subsets:

Consensus Subset: 3,671 examples flagged for high-priority, physician-agreed criteria (34 core items). This focuses on critical safety, correctness, and guideline adherence (e.g., explicit emergency escalation instructions) (Arora et al., 13 May 2025).
Hard Subset: 1,000 examples systematically selected for the lowest average performance among frontier models—comprising complex, ambiguous, or rare scenarios. The top current score on Hard is approximately 44.7 (Shanzhi-M1), while most other models and even strong proprietary systems remain below this mark (Jin et al., 20 Nov 2025).

Localization efforts, such as J-HealthBench, highlight that approximately 17% of examples or rubric criteria require context-aware modification due to mismatches with local clinical guidelines, healthcare systems, or societal norms (e.g., insurance schemes, drug legality, hotline numbers) (Hisada et al., 22 Sep 2025, Dey et al., 12 Nov 2025). These efforts involve adjusting scenario details, translating and adapting rubrics, and developing localized evaluation weights.

4. Comparative Model Performance and Benchmark Role

HealthBench has established itself as the canonical benchmark for health-LLM evaluation, measuring both closed- and open-source models. Notable recent results include:

Model	HealthBench Full	HealthBench Hard
Shanzhi-M1 (Qwen-32B+MR-RML)	62.7	44.7
Baichuan-M3	65.1	44.4
Baichuan-M2	33.2 (32B OSS), 63.6 (32B w/ verification)	34.7
Doctor-R1 (8B)	36.3 (Main)	18.7 (Hard, 300 cases)
GPT-5	97 (1000-item mini)	46.2
o3	60.0	32.0
GPT-4.1	31.2 (Main, 500)	16.9 (Hard, 300)

Frontier generalist LLMs—especially the OpenAI GPT-5 family—currently set the global upper bound, substantially outperforming both domain-specific clinical tools and open-source models across all axes, including completeness and context awareness (Vishwanath et al., 1 Dec 2025). Cost-efficient small models (e.g., GPT-4.1 nano) have made recent gains, achieving substantial performance increases at 1/25 the cost (Arora et al., 13 May 2025).

5. Critical Insights and Limitations

Performance breakdowns consistently reveal that model deficits cluster in Completeness and Context Awareness (jointly ≈ 55% of scoring criteria), while axes such as Emergency Referral and Expertise-Tailored Communication are less saturated but remain unsolved (Arora et al., 13 May 2025). Behavioral analysis under “worst-at-k” sampling demonstrates that even high-performing models display considerable unreliability in worst-case, safety-critical queries.

HealthBench’s rubric-based scoring structure, while rigorous, codifies physician expert opinion (GRADE D), and thus may reinforce regional, cultural, or idiosyncratic biases. These limitations become pronounced in global deployment: Western-centric requirements (e.g., defaulting to U.S. legislation, billing codes) systematically penalize responses that are otherwise medically and culturally appropriate in non-Western settings (Dey et al., 12 Nov 2025, Mutisya et al., 31 Jul 2025).

Calls for improvement emphasize localization of criteria, anchoring scoring in version-controlled Clinical Practice Guidelines with GRADE evidence weighting, and integrating dynamic “override” logic to account for contextually justified deviations (e.g., drug stockouts, idiosyncratic local protocols) (Mutisya et al., 31 Jul 2025). Evidence-weighted scoring and traceable guideline linkage are proposed as solutions for more clinically and ethically robust evaluation.

6. Extensions, RLHF, and Reward Modeling

HealthBench underpins recent RL and reward modeling frameworks targeting health-aligned LLM improvement. Approaches include:

MR-RML with GPRC: Multidimensional rubric-oriented reward model learning with geometric constraints, leading to substantial absolute (+17.7) and relative (+39%) gains on the full set, and the highest open-source performance on Hard (Jin et al., 20 Nov 2025).
ORBIT Incremental RL: Dynamic creation of synthetic rubrics to train small models (Qwen3-4B) from 7.0 to 27.2 on Hard using only 2k cases, sharply improving sample efficiency for complex clinical behaviors (Wang et al., 17 Oct 2025).
Agentic RL Architectures (Doctor-R1): Joint optimization of “soft” communicative skills and “hard” medical accuracy yields improvements in communication quality and completeness, establishing new open-source state-of-the-art at low parameter count (Lai et al., 5 Oct 2025).

These advances confirm the utility of rubric-based reward signals for incremental model alignment, but also reveal persistent vulnerabilities—“reward hacking,” context-misunderstanding, and static benchmark drift—which necessitate continual rubric renewal and integration with outcome-centric reinforcement loops.

7. Impact, Adoption, and Future Directions

HealthBench has become the de facto benchmark for end-to-end, behavior-level evaluation of medical LLMs and agentic systems, cited in nearly all state-of-the-art model releases and RL methodologies since 2025. Its influence extends to clinical tool assessment, where generalist LLMs have been shown to eclipse marketed decision support tools in alignment, completeness, and context sensitivity (Vishwanath et al., 1 Dec 2025). Despite this, domain-specific deployment demands ongoing localization and dynamic update of both scenario content and scoring logic.

Key recommended future directions include:

Development of regionally tuned and context-aware HealthBench variants (e.g., J-HealthBench for Japan).
Integration of guideline-based, evidence-weighted scoring pipelines to replace or augment static expert rubrics.
Expansion to longitudinal, multimodal, and workflow-linked evaluation to bridge the gap between snapshot QA and full clinical pathways (Team et al., 6 Feb 2026).
Interactive, hybrid human–AI grading protocols to ensure rubrics remain current, unbiased, and generalizable.

In summary, HealthBench provides the most rigorous, comprehensive, and clinically relevant evaluation suite for health-capable LLMs, catalyzing objective benchmark-driven progress in medical AI while surfacing essential limitations for trustworthy global deployment (Arora et al., 13 May 2025, Dey et al., 12 Nov 2025, Mutisya et al., 31 Jul 2025, Jin et al., 20 Nov 2025, Vishwanath et al., 1 Dec 2025, Team et al., 6 Feb 2026, Hisada et al., 22 Sep 2025).