Papers
Topics
Authors
Recent
Search
2000 character limit reached

VivaBench: Clinical Reasoning Benchmark

Updated 4 July 2026
  • VivaBench is a benchmark that assesses multi-turn, hypothesis-driven clinical reasoning rather than simple medical fact recall.
  • It simulates a sequential diagnostic process where LLM agents actively gather information, revise hypotheses, and calibrate confidence based on clinical data.
  • The benchmark evaluates performance via precision, recall, and confidence metrics, highlighting the gap between full-information synthesis and interactive reasoning.

Searching arXiv for the specified paper and closely related benchmark context. VivaBench is a multi-turn, viva voce–style benchmark for evaluating whether LLM agents can carry out hypothesis-driven clinical reasoning under uncertainty rather than merely recall medical facts. It operationalizes diagnosis as an iterative process in which the agent starts from limited, ambiguous information, actively probes for history and physical examination findings, orders investigations, and then produces provisional and final diagnoses with explicit confidence values. In this formulation, conversational medical AI is assessed not only on endpoint correctness but also on sequential information gathering, hypothesis revision, and calibration, with the benchmark positioned as an open-source, standardized environment for medical decision-support evaluation and for broader research on agentic AI (Chiu et al., 11 Oct 2025).

1. Clinical reasoning target and benchmark rationale

VivaBench was introduced to address a specific limitation of prevailing medical LLM evaluation: most widely used benchmarks, including MedQA, PubMedQA, and MultiMedQA, assess knowledge recall from fully specified, single-turn inputs. These settings omit the central clinical task of forming and refining hypotheses under missing information, uncertainty, and competing explanations. VivaBench instead adopts the viva voce format, where information is gradually revealed only in response to targeted queries, forcing the model to decide what to ask, what to test, and when to commit diagnostically (Chiu et al., 11 Oct 2025).

The benchmark frames clinical decision-making as a partially observable process. Physicians are described as approximately updating P(specific diagnosisgeneric findings)P(\text{specific diagnosis} \mid \text{generic findings}) as evidence accrues through history, physical examination, imaging, and laboratory data. This emphasis matters because high single-turn accuracy can conceal metacognitive deficits, including overconfidence and failure to recognize knowledge gaps. A central implication of VivaBench is that strong performance on fully specified diagnostic synthesis does not guarantee competent sequential reasoning in underspecified clinical encounters.

The viva format also changes the object of evaluation. Instead of measuring only whether the final diagnosis matches the ground truth, the benchmark inspects the trajectory by which a diagnosis is reached. This includes the relevance of bedside review, the selectivity of investigations, the timing of provisional commitment, and the extent to which new evidence induces revision rather than confirmation of initial beliefs.

2. Corpus construction, curation, and scope

Cases were drawn from publicly available repositories: PubMed case reports, MedQA, and sample scenarios from physician colleges in Australia and the United Kingdom. Automated NLP and LLM screening prioritized common presentations with clear diagnostic pathways, multifaceted reasoning, and high educational value, after which human clinicians reviewed and corrected the structured cases. The generation pipeline produced 1,952 structured cases, and after clinician validation the curated evaluation set consists of 990 cases distributed across nine specialty groups (Chiu et al., 11 Oct 2025).

The specialty composition is reported as follows: Endocrine/Reproductive (150), Infectious Disease/Immunology (150), Cardiovascular/Metabolic (148), Gastrointestinal (147), Neurological/Psychiatric (136), Hematology/Oncology/Other (112), Pediatric (69), Respiratory (51), and Musculoskeletal/Pain (26). Each case includes demographics such as age, gender, and ethnicity and is structured to require multi-step reasoning. Aggregate demographic distributions are not reported.

A recurrent point of confusion concerns dataset size. The paper notes an earlier selection of 1,152 PubMed cases prior to consolidation, and it explicitly states that the figure 1,762 does not appear in the paper’s reported case counts. The released resources are the larger 1,952-case structured pool and the curated 990-case evaluation set. The intended use is evaluation only: no train/dev/test splits are defined.

This design choice signals that VivaBench is not primarily a dataset for supervised optimization. It is instead a measurement instrument intended to expose sequential reasoning behavior under controlled interactive conditions.

3. Case representation and interactive protocol

Each clinical case CC is represented through five components: History HH, Physical Examination PP, Imaging II, Laboratory investigations LL, and a ground-truth Diagnosis set DD with accepted differentials DD'. These items are mapped to SNOMED-CT for HH and PP, LOINC for CC0 and CC1, and ICD-10 for CC2, enabling deterministic matching between free-text agent queries and the structured case state (Chiu et al., 11 Oct 2025).

The action space contains four information actions and two diagnosis actions. The information actions are query CC3, perform CC4, order CC5, and order CC6. The diagnosis actions are a provisional diagnosis after Review and a final diagnosis after Investigation. The protocol imposes four workflow constraints:

  1. First gather bedside information through CC7 and CC8.
  2. Submit a provisional diagnosis before ordering investigations.
  3. Once investigations are ordered, no further history or physical-examination queries are allowed.
  4. Execute one action per turn and provide a final diagnosis when sufficient information has been obtained.

The interaction is bounded by a global turn limit of 20 steps per case, with category-specific caps of up to 10 History queries, 5 Physical Examination queries, 3 Imaging orders, and 3 Lab orders. Failed requests caused by formatting can be retried twice; these retries count toward the global limit but not the category-specific limits.

Information revelation is mediated by an Examiner module CC9, which maps agent free-text queries to the structured case keys and returns natural-language responses. For history and physical examination, explicitly queried negative findings are returned. For common laboratory tests not specified in the case, default normal values with reference ranges are returned. Unavailable investigations are reported as “not available.” The agent must output up to five diagnoses at each diagnostic stage, with condition name, ICD-10 name, code, and a confidence score in HH0.

4. Retrieval, scoring, and calibration machinery

VivaBench measures performance at three diagnostic stages: provisional diagnosis before investigations (HH1), final diagnosis after interaction (HH2), and a Full Information pre-test in which all HH3 information is supplied upfront (HH4). Top-HH5 diagnostic accuracy is scored using both exact and approximate match criteria. Exact match requires correspondence to a ground-truth ICD-10 code or name at the appropriate level, and for the interactive stages it additionally requires that the agent ordered at least one relevant investigation supporting that diagnosis. Approximate match extends credit to accepted differentials, higher-level ICD-10 hierarchy matches, or predictions with high semantic similarity (cosine HH6), again with at least one relevant supporting investigation during interactive evaluation (Chiu et al., 11 Oct 2025).

Confidence is evaluated in two ways. The benchmark reports mean raw confidence at the provisional and final stages, denoted HH7 and HH8. It also reports a confidence-weighted score,

HH9

where PP0, PP1, and PP2 are the sets of exact, approximate, and unmatched predictions. In the appendix, normalized confidences PP3 satisfy PP4, yielding a normalized PP5.

Information-seeking efficiency is quantified through precision and recall relative to expert-annotated diagnosis-relevant items:

PP6

These are computed separately for Review (PP7) and Investigation (PP8), both overall and targeted to active hypotheses.

The retrieval layer is designed for reproducibility. A deterministic mapper uses cosine-similarity embeddings, SNOMED-CT mediated similarity, medical entity recognition, and domain synonym dictionaries such as LOINC and imaging modality lexicons. An LLM-based mapper using gpt-4.1 at temperature 0 interprets free-text queries grounded to the case’s available keys. On a 100-case calibration set, the LLM mapper showed higher precision and recall than the deterministic mapper and high determinism, with intersection-over-union greater than 0.99 over repeated identical inputs. Inter-rater reliability for mapping and query validation is reported as weighted Cohen’s PP9, indicating moderate agreement.

5. Baseline evaluation and empirical findings

The benchmark was evaluated with Gemini 2.5 Pro, DeepSeek-R1, OpenAI o4-mini, Llama-4 Maverick, xAI Grok 3 Mini Beta, and Qwen 3 (235b-a22b), all run at temperature 0 via OpenRouter with a standardized system prompt implementing the two-phase protocol. First-pass completion success exceeded 97% for all models, and metrics were computed on the intersection of successful cases, totaling 934 cases. A full run consumed approximately 20–39 million tokens per model (Chiu et al., 11 Oct 2025).

The principal empirical result is a marked drop from single-turn synthesis under Full Information to multi-turn interaction. Across all models, II0 was substantially higher than final interactive accuracy II1, indicating that models often “knew the answer” when given all evidence but failed to gather and use evidence effectively when acting sequentially.

Model Top-1 exact: II2 vs. II3 Average actions per case
Gemini 2.5 Pro 0.35 vs 0.69 8.8
o4-mini 0.32 vs 0.63 8.9
DeepSeek-R1 0.23 vs 0.61 5.5
Llama-4 Maverick 0.23 vs 0.52 8.5
Grok 3 Mini Beta 0.16 vs 0.60 7.0
Qwen 3 0.21 vs 0.47 5.5

The same degradation appears in top-3 exact accuracy. Gemini 2.5 Pro reached top-3 exact II4 versus II5; o4-mini reached II6 versus II7; DeepSeek-R1 II8 versus II9; Llama-4 Maverick LL0 versus LL1; Grok 3 Mini Beta LL2 versus LL3; and Qwen 3 LL4 versus LL5. Expanded top-LL6 approximate accuracies follow the same pattern, with Gemini’s approximate top-1 accuracy reported as LL7 versus LL8.

Confidence rose sharply from provisional to final stages for most models. For example, Gemini 2.5 Pro moved from LL9 to DD0, and o4-mini from DD1 to DD2. This increase was not uniformly desirable: the benchmark reports that most models increased confidence in maintained diagnoses regardless of correctness, indicating confirmation bias. Correlation analysis further showed that diagnosis removal and diagnosis maintenance were positively associated with improvements in exact accuracy, approximate accuracy, and DD3, while larger absolute confidence shifts DD4 correlated with performance gains. The paper notes that some models, particularly DeepSeek and Llama, showed better calibration of confidence shifts toward correctness.

Information-seeking efficiency exhibited a characteristic asymmetry. Precision was generally higher than recall, suggesting that models were selective but incomplete in their inquiries. Bedside review through history and physical examination was better targeted than investigation ordering, whereas laboratory and imaging requests often showed low recall and, in many cases, low precision due to generic or niche tests unrelated to the case.

Specialty-level analysis reported relative strengths in Infectious Disease/Immunology and Cardiovascular/Metabolic, and weaknesses in Pediatric and Neurological/Psychiatric cases, which the authors associate with nuanced symptom interpretation and developmental context.

6. Failure modes and illustrative diagnostic trajectories

The benchmark identified four recurrent cognitive errors that mirror familiar clinical pitfalls: fixation or anchoring on initial hypotheses, excessive or inappropriate investigations, premature diagnostic closure or satisfaction-of-search, and failure to screen for critical or time-sensitive conditions. A separate LLM-based classifier grouped failures across cases into inappropriate hypothesis generation (348), premature diagnostic closure (291), inadequate investigations (90), and ICD coding errors (9), indicating that reasoning deficits dominated over pure knowledge gaps (Chiu et al., 11 Oct 2025).

The example transcripts clarify how these errors arise in interaction. In a pancreatitis case involving a 28-year-old male with severe epigastric pain radiating to the back, the agent appropriately reviewed pain characteristics and risk factors, found epigastric tenderness, prioritized acute pancreatitis provisionally, ordered amylase and lipase, and then stopped after confirming pancreatitis. It failed to investigate the underlying etiology and missed the documented duodenal ulcer-induced obstruction. The proximal diagnosis was correct, but the workup was incomplete.

In a posterior fossa stroke case involving facial droop, dysphagia, and severe hypertension, the agent localized to the brainstem and suspected pontine infarct, but ordered only non-contrast CT, which has low sensitivity for posterior fossa infarcts. After chronic changes were seen, the model concluded transient ischemic attack rather than the documented pontomedullary stroke. The error lay not in total ignorance of the syndrome but in inappropriate sequencing and interpretation of investigations under uncertainty.

A pediatric example involved a 4-week-old infant with poor feeding, “trouble breathing,” normal saturation, and bilateral crackles. The agent anchored on heart failure and cascaded into BNP, echocardiography, and ECG while failing to prioritize more likely pediatric etiologies given the age and presentation. This case illustrates how early miscalibration of the hypothesis set can distort all downstream information acquisition.

These examples support the benchmark’s broader claim that sequential reasoning can derail even when knowledge of canonical presentations is present. The clinically salient problem is therefore not only wrong answers, but wrong paths.

7. Relation to other benchmarks, limitations, and broader significance

VivaBench is explicitly contrasted with knowledge-centric QA benchmarks such as MedQA, PubMedQA, and MultiMedQA, which test factual recall or short reasoning over fully specified inputs. It is also distinguished from interactive medical environments such as AI Hospital, AgentClinic, and AMIE. The paper argues that those environments often rely on human-in-the-loop or non-deterministic multi-LLM components, whereas VivaBench emphasizes open-source structured cases, standardized coding through SNOMED-CT, LOINC, and ICD-10, deterministic information retrieval, and an extensible evaluation framework centered on diagnostic sequential decision-making (Chiu et al., 11 Oct 2025).

The released repository is hosted at https://huggingface.co/datasets/chychiu/VivaBench/. Each entry includes source metadata, the free-text vignette, ground-truth diagnoses and differentials, and a machine-readable JSON case object aligned with the evaluation framework. Cases are derived from published case reports and other publicly available sources; no private patient datasets were used in the released benchmark. Ethical review and formal licensing terms are not specified in the paper. The authors also report a contamination analysis suggesting that the evaluated models did not simply memorize the source PubMed reports.

Several limitations are stated directly. The dataset is modest relative to large QA corpora and is largely derived from case reports, which may bias it toward academic or atypical narratives. The deterministic mapper cannot capture the full variability of clinical communication, while the LLM mapper, despite high empirical determinism, introduces a residual non-determinism concern. Each model was run once because of computational cost, so stochastic variability across long horizons was not explored. The viva protocol itself is a simplification of real encounters: it omits cost, time-to-result, and multi-user constraints and provides only a limited toolset.

The broader significance of VivaBench lies in the specific capability gap it exposes: a persistent difference between encoded medical knowledge and the agentic capacity to plan, gather, revise, and calibrate under uncertainty. This suggests that advancing medical LLMs for clinical decision support will require progress in information-seeking recall, hypothesis revision, confidence calibration, and safety-critical screening, not merely stronger single-turn diagnostic synthesis. A plausible implication is that benchmarks of this form will remain important beyond medicine, because they make sequential reasoning trajectories explicit and measurable in complex, partially observable decision environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VivaBench.