HealthBench-1k: Benchmark for Health LLMs

Updated 25 July 2025

HealthBench-1k is a rigorously constructed benchmark designed for evaluating LLM safety and performance in high-stakes, multi-turn clinical conversations.
It leverages detailed, physician-authored rubrics assessing accuracy, completeness, context awareness, and communication quality across diverse medical scenarios.
Performance metrics highlight significant gaps in current LLMs, offering actionable insights for reinforcement learning and cost-quality optimization in health AI.

HealthBench-1k is a rigorously constructed benchmark for assessing the safety, accuracy, and behavioral competence of LLMs in open-ended, clinically impactful health conversations. It represents a 1,000-example subset of the larger HealthBench framework that specifically targets cases deemed challenging for current generative models in realistic medical and health settings. HealthBench-1k enables multi-dimensional evaluation across a broad spectrum of subdomains and use cases, emphasizing high-stakes, conversational reasoning where there are often no single correct answers.

1. Structure and Scope

HealthBench-1k consists of 1,000 multi-turn health conversations, each involving either a layperson or healthcare professional interacting with an LLM. The conversations are paired with detailed rubrics, totaling many thousands of evaluation criteria spanning a variety of clinical and public health contexts. Prominent categories include emergency triage, transforming clinical data, global health considerations, handling missing or ambiguous context, adapting communication to audience expertise, correctly summarizing and synthesizing patient records, and managing uncertainty.

Each example contains:

Mean of 2.6 conversational turns
Average character length per conversation around 667 characters
Between 2 and 48 rubric criteria per example, capturing multiple behavioral and clinical dimensions

The conversations and associated rubrics are derived from real-world and simulated health scenarios, intentionally covering cases that challenge the limits of current LLM capability, both in factual knowledge and reasoning under uncertainty.

2. Rubric-Based Evaluation Framework

A central feature of HealthBench-1k is its use of conversation-specific, structured rubrics. These rubrics were authored by a panel of 262 physicians representing 60 countries and encode 48,562 unique assessment criteria across the whole HealthBench dataset. Each criterion is assigned a weight (from –10 to 10), and judgments are rendered by automated LLM-based grading.

Key behavioral axes explicitly labeled in the rubric structure include:

Accuracy
Completeness
Context Awareness
Communication Quality
Instruction Following

The individual per-example score is computed as:

$s_i = \frac{\sum_j 1_{\{r_{ij}\}} \cdot p_{ij}}{\sum_j \max(0, p_{ij})}$

where $1_{\{r_{ij}\}}$ indicates satisfaction of criterion $j$ for example $i$ and $p_{ij}$ is its point value.

A subset of the rubrics, known as consensus criteria, comprises 34 items validated by physician agreement and is used in the HealthBench Consensus variant for cross-model and longitudinal comparison.

3. Performance Metrics and Findings

Aggregate and per-dimension model performance is quantified by normalizing per-example scores and computing means (clippped to [0,1]), as well as analyzing worst-case reliability through worst-at-k aggregation (i.e., minimum score among a batch of $k$ samples).

Historical performance shows notable improvement: GPT-3.5 Turbo scored 16%, GPT-4o about 32%, and the recent o3 model achieved 60%. On the HealthBench-1k "hard" subset, even state-of-the-art models topped out at 32%, showing substantial headroom. Also, smaller models such as GPT-4.1 nano have closed performance gaps while yielding a 25-fold reduction in inference cost compared to larger models, underscoring the progress in cost-quality optimization.

The table below summarizes key model results (excerpted from the full HealthBench paper):

Model	Overall Score (HealthBench-1k)	Relative Cost
GPT-3.5 Turbo	~16%	Baseline
GPT-4o	~32%	25×
o3	60% (full), 32% (hard)	1×
GPT-4.1 nano	>32% (beats GPT-4o)	1/25×

4. Comparison to Other Health Benchmarks

HealthBench-1k distinguishes itself from prior work such as LongHealth and CHBench by employing open-ended, conversational, and multi-turn interactions, compared to multiple-choice or open-response single-turn benchmarks (Adams et al., 25 Jan 2024, Guo et al., 24 Sep 2024). The evaluation rubric is both richer and more clinically aligned, with explicit multi-axis criteria tailored for each interaction, whereas prior benchmarks typically score only accuracy or semantic similarity.

Furthermore, HealthBench includes dedicated "hard" and "consensus" subsets: HealthBench-1k (also referred to as "HealthBench Hard") contains 1,000 examples selected for their adversarial nature—i.e., they systematically expose weaknesses common to multiple top-performing models, not just statistical outliers.

5. Reinforcement Learning and Rubric-Based Training

HealthBench-1k has facilitated methodologies that leverage structured rubrics as reward signals in reinforcement learning. The Rubrics as Rewards (RaR) framework (Gunjal et al., 23 Jul 2025) employs explicit (weighted checklist) and implicit (holistic rubric-guided) reward computations:

$r(x, \hat{y}) = \frac{\sum_{j=1}^k w_j \cdot c_j(x, \hat{y})}{\sum_{j=1}^k w_j}$

where $w_j$ denotes criterion weight, and $c_j(x, \hat{y})$ indicates satisfaction. On HealthBench-1k, RaR yields a 28% relative improvement over simple Likert-based reward aggregation. Rubric-based signals also outperform traditional reference-alignment and preference-based scores, and align well with human expert evaluation, offering scalable, interpretable, and efficient fine-tuning approaches for instruction-following models in subjective health domains.

6. Limitations and Future Trajectory

Despite significant progress, results on HealthBench-1k suggest major unsolved challenges: even top models reach only one-third of the possible score on adversarial health cases, and critical behavioral dimensions such as context awareness, risk minimization, and completeness remain stubbornly difficult. The open-ended nature of conversations, the weighting of both harmful and beneficial criteria, and the diversity of interaction contexts guarantee that HealthBench-1k is an unsaturated benchmark—i.e., it continues to reveal real model limitations.

A plausible implication is that further gains will be driven by:

Improved rubric-guided RL training and curriculum-aware reward adjustment
Incorporation of domain-specific expert feedback into new rubric items
Systematic evaluation of worst-case and outlier performance for deployment safety
Deeper integration with physician consensus criteria to maximize clinical relevance and minimize hallucination risk

7. Access and Community Implications

The data and grading code for HealthBench, including the 1k "hard" subset, are openly released (via OpenAI’s simple-evals repository). This enables broad participation in model evaluation and methodological innovation, supporting the shared goal of developing trustworthy, effective AI in health. By providing standardized, clinically reflected, and behaviorally granular evaluation infrastructure, HealthBench-1k sets a foundational standard for progress in medical LLM development and deployment.

PDF Markdown Chat (Upgrade)

References (3)

1.

LongHealth: A Question Answering Benchmark with Long Clinical Documents (2024)

2.

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models (2024)

3.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains (2025)