- The paper introduces HealthBench, an open‐source benchmark that uses 5,000 realistic multi-turn healthcare conversations and physician-designed rubrics to assess LLM performance and safety.
- It employs a unique scoring system based on 48,562 rubric criteria across themes like emergency referrals, context-seeking, and tailored communication for detailed analysis.
- Evaluation results show significant progress, with models like GPT-4.1 achieving notable safety and cost efficiency improvements over earlier iterations.
HealthBench (2505.08775) is an open-source benchmark designed to evaluate the performance and safety of LLMs specifically in healthcare settings. The benchmark aims to address limitations in previous medical evaluations, which often relied on narrow multiple-choice questions, lacked validation against expert medical opinions, and were becoming saturated by state-of-the-art models. HealthBench strives to be meaningful (reflecting real-world impact), trustworthy (validated by physician judgment), and unsaturated (providing headroom for future model improvement).
The core of HealthBench consists of 5,000 realistic multi-turn conversations between an LLM and a user, who could be an individual seeking health information or a healthcare professional. The task for the LLM is to generate a response to the final user message in the conversation. Unlike evaluations relying on fixed answers, HealthBench uses a rubric evaluation system. For each conversation, a unique rubric is created by physicians, containing specific criteria that describe attributes a model response should be rewarded or penalized for. These criteria encompass specific factual details, clarity of communication, safety considerations, and adherence to instructions. There are 48,562 unique rubric criteria across the benchmark, each with a point value between -10 and 10.
To score a model response for a given conversation, a model-based grader (which the authors validate against physician judgment) assesses whether the response meets each criterion in the conversation-specific rubric. The score for the example is calculated by summing the points for all criteria met and dividing by the sum of positive points available in the rubric. This per-example score can be negative if penalties outweigh positive points. The overall HealthBench score for a model is the mean of its per-example scores, clipped to the range [0, 1].
The benchmark conversations and criteria are organized along seven themes and five axes to provide granular performance analysis. The themes represent different areas of real-world health interactions:
- Emergency referrals: Evaluating the model's ability to recognize potential emergencies and appropriately guide users towards urgent care.
- Context-seeking: Assessing if the model identifies missing crucial information and asks appropriate clarifying questions, essential for real-world interactions where users may not provide full details upfront.
- Global health: Measuring the model's capacity to adapt responses to varied healthcare contexts, considering resource availability, local norms, and regional disease patterns.
- Health data tasks: Evaluating the model's accuracy and safety when performing structured tasks related to health data, such as summarizing clinical notes or extracting information.
- Expertise-tailored communication: Determining if the model can adapt its language, technical depth, and tone based on whether the user is a healthcare professional or a layperson.
- Responding under uncertainty: Examining the model's ability to recognize when medical knowledge or user input is uncertain and respond appropriately, avoiding overconfidence.
- Response depth: Assessing if the model adjusts the level of detail in its response to match the user's needs and the complexity of the query.
The five axes define the behavioral dimensions measured by the rubric criteria:
- Accuracy: Whether the information provided is factually correct and aligned with medical consensus, including acknowledging uncertainty when evidence is weak.
- Completeness: Whether the response includes all necessary information to be safe and helpful.
- Communication quality: Whether the response is clear, well-structured, concise, and uses appropriate language for the user.
- Context awareness: Whether the model uses relevant contextual cues (user role, location, resources) and seeks clarification when needed.
- Instruction following: Whether the model adheres to specific user instructions while maintaining safety priorities.
HealthBench was developed with input from 262 physicians from diverse backgrounds (60 countries, 26 specialties), who were involved in defining the situations, writing rubrics, and annotating data. Conversations were generated synthetically based on physician-defined realistic scenarios, supplemented with data from physician red teaming and modified health search queries. The generated conversations were filtered for realism, consistency, completeness of messages, and relevance to physical health using model-based classifiers. Physicians then wrote the example-specific rubrics.
HealthBench includes two variations:
- HealthBench Consensus: A subset of examples and criteria focusing on 34 critical, pre-defined criteria validated by multiple physicians (e.g., clear emergency referral in an emergent situation). This subset offers higher physician validation for key behaviors.
- HealthBench Hard: A subset of 1,000 examples identified as particularly challenging for current frontier models, designed to provide a difficult target for future development.
Evaluation results presented in the paper show significant progress in LLM performance on HealthBench over time, especially among recent models. For example, OpenAI's o3 model achieves a score of 60%, a substantial improvement over GPT-4o (Aug 2024) at 32% and GPT-3.5 Turbo at 16%. Performance varies by theme, with Emergency referrals and Expertise-tailored communication generally scoring higher than Context-seeking and Health data tasks. By axis, models tend to perform better on Accuracy, Communication quality, and Instruction following than on Completeness and Context awareness.
The benchmark also reveals improvements in performance-cost efficiency. Models like GPT-4.1 nano outperform older, more expensive models like GPT-4o (Aug 2024), being 25 times cheaper. Reliability, measured by worst-at-k performance (the score of the worst response in a batch of k samples), has also improved, though substantial headroom remains, indicating models are not yet fully reliable in high-stakes situations. Analysis of score distribution suggests HealthBench examples span a good range of difficulty, with few examples being completely unsolvable or already solved by all models. While there is some correlation between response length and score, a win-rate analysis controlling for length suggests that improved performance is due to factors beyond just verbosity.
HealthBench Consensus results show a significant reduction in error rates for critical behaviors across models, falling by over 4x from GPT-3.5 to GPT-4.1. However, challenges remain in areas like context-seeking and appropriate response depth. HealthBench Hard results confirm its difficulty, with the highest-scoring model achieving only 32%, highlighting it as a benchmark for future model development.
To establish a baseline, the authors evaluated responses written by physicians. Physicians were able to improve responses generated by older models (Sep 2024) but not those from newer models (Apr 2025). Physician-written responses without AI assistance scored lower than those produced by models, a finding the authors interpret with caution, noting that writing chatbot responses is not a typical physician task and response length may have played a role.
A key aspect of HealthBench is the trustworthiness of its model-based grading. Meta-evaluation, comparing GPT-4.1 grader's assessments against physician consensus on the 34 critical criteria using macro F1 score, shows that the model grader's performance is comparable to or exceeds the average physician's agreement, placing it in the upper percentiles of individual physician performance. Overall HealthBench scores show low variability across multiple runs.
The authors acknowledge limitations, including inherent physician variability in judgment and the fact that example-specific criteria written by individual physicians were not double-validated (unlike consensus criteria). They emphasize that HealthBench provides a broad evaluation of single responses within multi-turn conversations, but specific workflow evaluations and measurement of health outcomes remain crucial for real-world application.
HealthBench is released open-source at \href{https://github.com/openai/simple-evals}{https://github.com/openai/simple-evals}, including code for running evaluations, meta-evaluations, and analyses. The authors request that examples not be revealed online to prevent leakage into training data. The benchmark aims to shape shared standards, provide evidence of model capabilities to the healthcare community, and demonstrate recent progress, ultimately accelerating the development of AI models that benefit human health.