HealthBench Hard: Challenging Clinical LLMs

Updated 23 February 2026

HealthBench Hard is a subset of 1,000 difficult, multi-turn clinical dialogues designed to reveal LLM limitations in decision-making under uncertainty.
It uses a detailed rubric system with criteria like accuracy and context awareness, ranking examples based on low normalized scores from top-performing models.
Advances in RL methods and dynamic verifier systems, such as MuSeR and agentic models, are key to addressing safety, hallucination, and context gaps in clinical LLM performance.

HealthBench Hard is a rigorously constructed evaluation subset within the HealthBench suite, designed to stress-test LLMs on the most difficult, open-ended, clinically realistic healthcare conversations. Composed of 1,000 challenging dialogues selected for their low average performance across top frontier models, it exposes the critical failure modes and unsolved edge cases that limit current LLM reliability in medical domains. Unlike conventional benchmarks based on simple queries or static QA, HealthBench Hard systematically targets decision-making under uncertainty, context fragmentation, and high-stakes safety scenarios, and is pivotal for benchmarking the trajectory and limitations of LLMs intended for clinical deployment (Arora et al., 13 May 2025, Team et al., 2 Sep 2025, Zhou et al., 13 Nov 2025, Ravichandran et al., 29 Aug 2025).

1. Construction and Scope

HealthBench Hard originates from the full 5,000-example HealthBench evaluation, which itself spans open-ended, multi-turn dialogues annotated with more than 48,000 physician-written rubric criteria (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025). The Hard subset is formed by ranking all examples according to their mean normalized score across five contemporaneous high-performing models (o3, Grok 3, Gemini 2.5 Pro, Claude 3.7 Sonnet, Llama 4 Maverick) and selecting the 1,000 lowest-scoring items, excluding those trivially unsolvable for all. This selection strategy retains maximal rubric diversity while focusing on the most challenging queries, as evidenced by model underperformance.

Examples in HealthBench Hard maintain the diversity of original HealthBench themes: emergency referrals, context-seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty, and response depth. Both single-turn and multi-turn (mean 2.6 turns, up to 19) dialogues are present, and the subset includes queries from both laypersons and clinicians, often in resource-constrained or ambiguous contexts (Arora et al., 13 May 2025, Zhou et al., 13 Nov 2025).

2. Rubric System and Scoring Methodology

Each Hard example is annotated with a detailed, example-specific rubric consisting on average of ~11.5 items (occasionally several dozen), each assigned a point value in [–10, +10]. Criteria cover the behavioral axes: accuracy, completeness, context awareness, communication quality, instruction following (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025). The scoring protocol is as follows. Let $p_{ij}$ be the point value for criterion $j$ in example $i$ , and $I$ indicate fulfillment. The example’s normalized score before clipping is

$s_i = \frac{\sum_j I[\text{response meets criterion } j] \cdot p_{ij}}{\sum_j \max(0, p_{ij})}$

Scores are clipped to $[0,1]$ , and the overall HealthBench Hard score ( $S_\text{Hard}$ ) is the mean of $s_i$ over all 1,000 examples, again clipped to $[0,1]$ . Meta-evaluation using model-graded versus physician-graded assessments, with macro-F1 near human agreement level, supports the reliability of annotation and scoring (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025).

A parallel evaluation has also been adopted for task-specific leaderboards (e.g., axis-level aggregation for each behavioral trait, as in (Ravichandran et al., 29 Aug 2025)), ensuring axis-wise transparency in failure mode analysis. Negative subtotaling allows penalization of harmful or incomplete answers, thus better reflecting safety constraints.

3. Task Characteristics and Failure Modes

HealthBench Hard is characterized by high clinical and linguistic complexity, intentional omission of key details, diversity in user roles (patient, caregiver, physician), and strong regional or situational dependencies. Scenarios frequently involve (Arora et al., 13 May 2025, Zhou et al., 13 Nov 2025, Ravichandran et al., 29 Aug 2025):

Ambiguous or incomplete information, necessitating active context-seeking.
Requirements to modulate terminology or advice for audience (lay vs. professional).
Demands to give safe guidance under uncertainty or risk of adverse outcomes.
Edge-case tasks that probe for overconfidence, hallucinated claims, and locale-incongruent advice.

The observed dominant failure modes for frontier LLMs on HealthBench Hard include: omission of essential steps or safety checks (completeness failures), failure to elicit or incorporate missing context (context awareness), overconfident or non-hedging conclusions, and misalignment with regional or scenario-specific constraints. For instance, even models with state-of-the-art general performance may propose unsafe recommendations or fail to clarify ambiguous elements in multi-step problem setups (Arora et al., 13 May 2025, Zhou et al., 13 Nov 2025).

4. Model Performance and Comparative Benchmarks

Historically, HealthBench Hard has demarcated progress among open-source and proprietary LLMs. Early frontier models (GPT-4o, GPT-3.5 Turbo) scored as low as 0.15 and 0.06, respectively; by contrast, o3 reached 0.32 and GPT-5 series models up to 0.46 (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025). Specialized model development, such as Baichuan-M2 and Qwen3-32B with Multifaceted Self-Refinement (MuSeR), produced open-source results of 34.7 and 43.1, respectively (Team et al., 2 Sep 2025, Zhou et al., 13 Nov 2025). The Baichuan-M3-235B model further advanced the state of the art to 44.4, surpassing GPT-5.2-High and AntAngelMed (Team et al., 6 Feb 2026). Crucially, agentic and RAG-based assistants such as DR.INFO demonstrated substantial gains (0.51), exceeding all published frontier and open-source scores (Ravichandran et al., 29 Aug 2025).

Below, sample results reported for top models evaluated on Hard:

Model	Score on Hard
DR.INFO	0.51
Baichuan-M3-235B	44.4
Qwen3-32B + MuSeR	43.1
Baichuan-M2-32B	34.7
GPT-5 (closed)	46.2 (or 0.46, depending on scale)
o3	0.32
GPT-4.1	approx. 0.26–0.31 (across studies)
GPT-3.5 Turbo	0.06

Axis-level breakdowns highlight persistent gaps even for leading models: accuracy and communication typically lead (e.g., up to 0.56 and 0.65 for DR.INFO), while completeness and context awareness lag, with scores < 0.45 and < 0.40, respectively (Ravichandran et al., 29 Aug 2025).

5. Methodological Advances Targeting Hard Cases

Advances in training and RL methods have directly targeted HealthBench Hard limitations. The Baichuan-M2 and -M3 models implemented dynamic verifier systems: patient simulation with socio-cultural profiling, clinical rubrics generation, and segmented RL pipelines, enabling models to proactively elicit missing information and to plan over multi-turn dialogues (Team et al., 2 Sep 2025, Team et al., 6 Feb 2026). Fact-aware RL objectives, such as hallucination-suppressed reward and dynamic gating coefficients, reduce the prevalence of unsupported or overconfident claims (Team et al., 6 Feb 2026).

The MuSeR method explicitly models decision-making (seeking missing context), communication (audience adaptation), and safety (risk identification), conducting self-evaluation and response refinement along these facets; this produces substantial performance uplifts, especially in context-awareness and safety (Zhou et al., 13 Nov 2025).

Agentic, retrieval-augmented generation systems (e.g., DR.INFO) further leverage external evidence to satisfy accuracy and completeness requirements in multi-document or ambiguous scenarios (Ravichandran et al., 29 Aug 2025).

6. HealthBench Hard in Safety and Refusal Evaluation

HealthBench-Hard also intersects with safety calibration benchmarks such as Health-ORSC-Bench, which uses a “Hard-1K” partition to measure over-refusal and safe completion rates on benign but boundary-formulaic prompts (Zhang et al., 25 Jan 2026). These probes simulate cases where high ambiguity or misleading wording could trigger unnecessary refusals. Empirical results demonstrate a clear safety–helpfulness trade-off: frontier models tune toward “safety-pessimism,” erring on the side of refusal (e.g., >66% ORR for GPT-5 family), whereas open-weight or domain-specialized models may achieve lower over-refusal rates but remain deficient in toxic query rejection and nuanced safe completion (Zhang et al., 25 Jan 2026).

7. Implications and Future Directions

HealthBench Hard exposes entrenched limitations in LLM decision-making, most notably in completeness, context awareness, and calibration under uncertainty. It has become a reference “stress test” for claims of clinical readiness; regulatory and institutional policies now recommend evidence of robust performance on Hard cases before deployment in sensitive clinical roles (Arora et al., 13 May 2025). Ongoing research prioritizes training curricula that target these unsolved instances, including dynamic rubric evolution, context-aware chains-of-thought, and refined RLHF with adversarial telehealth scenarios (Team et al., 2 Sep 2025, Zhou et al., 13 Nov 2025, Team et al., 6 Feb 2026).

A persistent implication is that hybrid human–AI workflows, with clinician oversight on Hard-classified queries, are prudent until future models can close the performance gap in completeness and safety. Development of agentic reasoning, robust context-tracking, and heuristic fallback strategies may be key to achieving comprehensive, real-world-safe medical LLMs.