Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
43 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

HealthBench-1k: Benchmark for Health LLMs

Updated 25 July 2025
  • HealthBench-1k is a rigorously constructed benchmark designed for evaluating LLM safety and performance in high-stakes, multi-turn clinical conversations.
  • It leverages detailed, physician-authored rubrics assessing accuracy, completeness, context awareness, and communication quality across diverse medical scenarios.
  • Performance metrics highlight significant gaps in current LLMs, offering actionable insights for reinforcement learning and cost-quality optimization in health AI.

HealthBench-1k is a rigorously constructed benchmark for assessing the safety, accuracy, and behavioral competence of LLMs in open-ended, clinically impactful health conversations. It represents a 1,000-example subset of the larger HealthBench framework that specifically targets cases deemed challenging for current generative models in realistic medical and health settings. HealthBench-1k enables multi-dimensional evaluation across a broad spectrum of subdomains and use cases, emphasizing high-stakes, conversational reasoning where there are often no single correct answers.

1. Structure and Scope

HealthBench-1k consists of 1,000 multi-turn health conversations, each involving either a layperson or healthcare professional interacting with an LLM. The conversations are paired with detailed rubrics, totaling many thousands of evaluation criteria spanning a variety of clinical and public health contexts. Prominent categories include emergency triage, transforming clinical data, global health considerations, handling missing or ambiguous context, adapting communication to audience expertise, correctly summarizing and synthesizing patient records, and managing uncertainty.

Each example contains:

  • Mean of 2.6 conversational turns
  • Average character length per conversation around 667 characters
  • Between 2 and 48 rubric criteria per example, capturing multiple behavioral and clinical dimensions

The conversations and associated rubrics are derived from real-world and simulated health scenarios, intentionally covering cases that challenge the limits of current LLM capability, both in factual knowledge and reasoning under uncertainty.

2. Rubric-Based Evaluation Framework

A central feature of HealthBench-1k is its use of conversation-specific, structured rubrics. These rubrics were authored by a panel of 262 physicians representing 60 countries and encode 48,562 unique assessment criteria across the whole HealthBench dataset. Each criterion is assigned a weight (from –10 to 10), and judgments are rendered by automated LLM-based grading.

Key behavioral axes explicitly labeled in the rubric structure include:

  • Accuracy
  • Completeness
  • Context Awareness
  • Communication Quality
  • Instruction Following

The individual per-example score is computed as:

si=j1{rij}pijjmax(0,pij)s_i = \frac{\sum_j 1_{\{r_{ij}\}} \cdot p_{ij}}{\sum_j \max(0, p_{ij})}

where 1{rij}1_{\{r_{ij}\}} indicates satisfaction of criterion jj for example ii and pijp_{ij} is its point value.

A subset of the rubrics, known as consensus criteria, comprises 34 items validated by physician agreement and is used in the HealthBench Consensus variant for cross-model and longitudinal comparison.

3. Performance Metrics and Findings

Aggregate and per-dimension model performance is quantified by normalizing per-example scores and computing means (clippped to [0,1]), as well as analyzing worst-case reliability through worst-at-k aggregation (i.e., minimum score among a batch of kk samples).

Historical performance shows notable improvement: GPT-3.5 Turbo scored 16%, GPT-4o about 32%, and the recent o3 model achieved 60%. On the HealthBench-1k "hard" subset, even state-of-the-art models topped out at 32%, showing substantial headroom. Also, smaller models such as GPT-4.1 nano have closed performance gaps while yielding a 25-fold reduction in inference cost compared to larger models, underscoring the progress in cost-quality optimization.

The table below summarizes key model results (excerpted from the full HealthBench paper):

Model Overall Score (HealthBench-1k) Relative Cost
GPT-3.5 Turbo ~16% Baseline
GPT-4o ~32% 25×
o3 60% (full), 32% (hard)
GPT-4.1 nano >32% (beats GPT-4o) 1/25×

4. Comparison to Other Health Benchmarks

HealthBench-1k distinguishes itself from prior work such as LongHealth and CHBench by employing open-ended, conversational, and multi-turn interactions, compared to multiple-choice or open-response single-turn benchmarks (Adams et al., 25 Jan 2024, Guo et al., 24 Sep 2024). The evaluation rubric is both richer and more clinically aligned, with explicit multi-axis criteria tailored for each interaction, whereas prior benchmarks typically score only accuracy or semantic similarity.

Furthermore, HealthBench includes dedicated "hard" and "consensus" subsets: HealthBench-1k (also referred to as "HealthBench Hard") contains 1,000 examples selected for their adversarial nature—i.e., they systematically expose weaknesses common to multiple top-performing models, not just statistical outliers.

5. Reinforcement Learning and Rubric-Based Training

HealthBench-1k has facilitated methodologies that leverage structured rubrics as reward signals in reinforcement learning. The Rubrics as Rewards (RaR) framework (Gunjal et al., 23 Jul 2025) employs explicit (weighted checklist) and implicit (holistic rubric-guided) reward computations:

r(x,y^)=j=1kwjcj(x,y^)j=1kwjr(x, \hat{y}) = \frac{\sum_{j=1}^k w_j \cdot c_j(x, \hat{y})}{\sum_{j=1}^k w_j}

where wjw_j denotes criterion weight, and cj(x,y^)c_j(x, \hat{y}) indicates satisfaction. On HealthBench-1k, RaR yields a 28% relative improvement over simple Likert-based reward aggregation. Rubric-based signals also outperform traditional reference-alignment and preference-based scores, and align well with human expert evaluation, offering scalable, interpretable, and efficient fine-tuning approaches for instruction-following models in subjective health domains.

6. Limitations and Future Trajectory

Despite significant progress, results on HealthBench-1k suggest major unsolved challenges: even top models reach only one-third of the possible score on adversarial health cases, and critical behavioral dimensions such as context awareness, risk minimization, and completeness remain stubbornly difficult. The open-ended nature of conversations, the weighting of both harmful and beneficial criteria, and the diversity of interaction contexts guarantee that HealthBench-1k is an unsaturated benchmark—i.e., it continues to reveal real model limitations.

A plausible implication is that further gains will be driven by:

  • Improved rubric-guided RL training and curriculum-aware reward adjustment
  • Incorporation of domain-specific expert feedback into new rubric items
  • Systematic evaluation of worst-case and outlier performance for deployment safety
  • Deeper integration with physician consensus criteria to maximize clinical relevance and minimize hallucination risk

7. Access and Community Implications

The data and grading code for HealthBench, including the 1k "hard" subset, are openly released (via OpenAI’s simple-evals repository). This enables broad participation in model evaluation and methodological innovation, supporting the shared goal of developing trustworthy, effective AI in health. By providing standardized, clinically reflected, and behaviorally granular evaluation infrastructure, HealthBench-1k sets a foundational standard for progress in medical LLM development and deployment.