VivaBench: Clinical Reasoning Benchmark
- VivaBench is a benchmark that assesses multi-turn, hypothesis-driven clinical reasoning rather than simple medical fact recall.
- It simulates a sequential diagnostic process where LLM agents actively gather information, revise hypotheses, and calibrate confidence based on clinical data.
- The benchmark evaluates performance via precision, recall, and confidence metrics, highlighting the gap between full-information synthesis and interactive reasoning.
Searching arXiv for the specified paper and closely related benchmark context. VivaBench is a multi-turn, viva voce–style benchmark for evaluating whether LLM agents can carry out hypothesis-driven clinical reasoning under uncertainty rather than merely recall medical facts. It operationalizes diagnosis as an iterative process in which the agent starts from limited, ambiguous information, actively probes for history and physical examination findings, orders investigations, and then produces provisional and final diagnoses with explicit confidence values. In this formulation, conversational medical AI is assessed not only on endpoint correctness but also on sequential information gathering, hypothesis revision, and calibration, with the benchmark positioned as an open-source, standardized environment for medical decision-support evaluation and for broader research on agentic AI (Chiu et al., 11 Oct 2025).
1. Clinical reasoning target and benchmark rationale
VivaBench was introduced to address a specific limitation of prevailing medical LLM evaluation: most widely used benchmarks, including MedQA, PubMedQA, and MultiMedQA, assess knowledge recall from fully specified, single-turn inputs. These settings omit the central clinical task of forming and refining hypotheses under missing information, uncertainty, and competing explanations. VivaBench instead adopts the viva voce format, where information is gradually revealed only in response to targeted queries, forcing the model to decide what to ask, what to test, and when to commit diagnostically (Chiu et al., 11 Oct 2025).
The benchmark frames clinical decision-making as a partially observable process. Physicians are described as approximately updating as evidence accrues through history, physical examination, imaging, and laboratory data. This emphasis matters because high single-turn accuracy can conceal metacognitive deficits, including overconfidence and failure to recognize knowledge gaps. A central implication of VivaBench is that strong performance on fully specified diagnostic synthesis does not guarantee competent sequential reasoning in underspecified clinical encounters.
The viva format also changes the object of evaluation. Instead of measuring only whether the final diagnosis matches the ground truth, the benchmark inspects the trajectory by which a diagnosis is reached. This includes the relevance of bedside review, the selectivity of investigations, the timing of provisional commitment, and the extent to which new evidence induces revision rather than confirmation of initial beliefs.
2. Corpus construction, curation, and scope
Cases were drawn from publicly available repositories: PubMed case reports, MedQA, and sample scenarios from physician colleges in Australia and the United Kingdom. Automated NLP and LLM screening prioritized common presentations with clear diagnostic pathways, multifaceted reasoning, and high educational value, after which human clinicians reviewed and corrected the structured cases. The generation pipeline produced 1,952 structured cases, and after clinician validation the curated evaluation set consists of 990 cases distributed across nine specialty groups (Chiu et al., 11 Oct 2025).
The specialty composition is reported as follows: Endocrine/Reproductive (150), Infectious Disease/Immunology (150), Cardiovascular/Metabolic (148), Gastrointestinal (147), Neurological/Psychiatric (136), Hematology/Oncology/Other (112), Pediatric (69), Respiratory (51), and Musculoskeletal/Pain (26). Each case includes demographics such as age, gender, and ethnicity and is structured to require multi-step reasoning. Aggregate demographic distributions are not reported.
A recurrent point of confusion concerns dataset size. The paper notes an earlier selection of 1,152 PubMed cases prior to consolidation, and it explicitly states that the figure 1,762 does not appear in the paper’s reported case counts. The released resources are the larger 1,952-case structured pool and the curated 990-case evaluation set. The intended use is evaluation only: no train/dev/test splits are defined.
This design choice signals that VivaBench is not primarily a dataset for supervised optimization. It is instead a measurement instrument intended to expose sequential reasoning behavior under controlled interactive conditions.
3. Case representation and interactive protocol
Each clinical case is represented through five components: History , Physical Examination , Imaging , Laboratory investigations , and a ground-truth Diagnosis set with accepted differentials . These items are mapped to SNOMED-CT for and , LOINC for 0 and 1, and ICD-10 for 2, enabling deterministic matching between free-text agent queries and the structured case state (Chiu et al., 11 Oct 2025).
The action space contains four information actions and two diagnosis actions. The information actions are query 3, perform 4, order 5, and order 6. The diagnosis actions are a provisional diagnosis after Review and a final diagnosis after Investigation. The protocol imposes four workflow constraints:
- First gather bedside information through 7 and 8.
- Submit a provisional diagnosis before ordering investigations.
- Once investigations are ordered, no further history or physical-examination queries are allowed.
- Execute one action per turn and provide a final diagnosis when sufficient information has been obtained.
The interaction is bounded by a global turn limit of 20 steps per case, with category-specific caps of up to 10 History queries, 5 Physical Examination queries, 3 Imaging orders, and 3 Lab orders. Failed requests caused by formatting can be retried twice; these retries count toward the global limit but not the category-specific limits.
Information revelation is mediated by an Examiner module 9, which maps agent free-text queries to the structured case keys and returns natural-language responses. For history and physical examination, explicitly queried negative findings are returned. For common laboratory tests not specified in the case, default normal values with reference ranges are returned. Unavailable investigations are reported as “not available.” The agent must output up to five diagnoses at each diagnostic stage, with condition name, ICD-10 name, code, and a confidence score in 0.
4. Retrieval, scoring, and calibration machinery
VivaBench measures performance at three diagnostic stages: provisional diagnosis before investigations (1), final diagnosis after interaction (2), and a Full Information pre-test in which all 3 information is supplied upfront (4). Top-5 diagnostic accuracy is scored using both exact and approximate match criteria. Exact match requires correspondence to a ground-truth ICD-10 code or name at the appropriate level, and for the interactive stages it additionally requires that the agent ordered at least one relevant investigation supporting that diagnosis. Approximate match extends credit to accepted differentials, higher-level ICD-10 hierarchy matches, or predictions with high semantic similarity (cosine 6), again with at least one relevant supporting investigation during interactive evaluation (Chiu et al., 11 Oct 2025).
Confidence is evaluated in two ways. The benchmark reports mean raw confidence at the provisional and final stages, denoted 7 and 8. It also reports a confidence-weighted score,
9
where 0, 1, and 2 are the sets of exact, approximate, and unmatched predictions. In the appendix, normalized confidences 3 satisfy 4, yielding a normalized 5.
Information-seeking efficiency is quantified through precision and recall relative to expert-annotated diagnosis-relevant items:
6
These are computed separately for Review (7) and Investigation (8), both overall and targeted to active hypotheses.
The retrieval layer is designed for reproducibility. A deterministic mapper uses cosine-similarity embeddings, SNOMED-CT mediated similarity, medical entity recognition, and domain synonym dictionaries such as LOINC and imaging modality lexicons. An LLM-based mapper using gpt-4.1 at temperature 0 interprets free-text queries grounded to the case’s available keys. On a 100-case calibration set, the LLM mapper showed higher precision and recall than the deterministic mapper and high determinism, with intersection-over-union greater than 0.99 over repeated identical inputs. Inter-rater reliability for mapping and query validation is reported as weighted Cohen’s 9, indicating moderate agreement.
5. Baseline evaluation and empirical findings
The benchmark was evaluated with Gemini 2.5 Pro, DeepSeek-R1, OpenAI o4-mini, Llama-4 Maverick, xAI Grok 3 Mini Beta, and Qwen 3 (235b-a22b), all run at temperature 0 via OpenRouter with a standardized system prompt implementing the two-phase protocol. First-pass completion success exceeded 97% for all models, and metrics were computed on the intersection of successful cases, totaling 934 cases. A full run consumed approximately 20–39 million tokens per model (Chiu et al., 11 Oct 2025).
The principal empirical result is a marked drop from single-turn synthesis under Full Information to multi-turn interaction. Across all models, 0 was substantially higher than final interactive accuracy 1, indicating that models often “knew the answer” when given all evidence but failed to gather and use evidence effectively when acting sequentially.
| Model | Top-1 exact: 2 vs. 3 | Average actions per case |
|---|---|---|
| Gemini 2.5 Pro | 0.35 vs 0.69 | 8.8 |
| o4-mini | 0.32 vs 0.63 | 8.9 |
| DeepSeek-R1 | 0.23 vs 0.61 | 5.5 |
| Llama-4 Maverick | 0.23 vs 0.52 | 8.5 |
| Grok 3 Mini Beta | 0.16 vs 0.60 | 7.0 |
| Qwen 3 | 0.21 vs 0.47 | 5.5 |
The same degradation appears in top-3 exact accuracy. Gemini 2.5 Pro reached top-3 exact 4 versus 5; o4-mini reached 6 versus 7; DeepSeek-R1 8 versus 9; Llama-4 Maverick 0 versus 1; Grok 3 Mini Beta 2 versus 3; and Qwen 3 4 versus 5. Expanded top-6 approximate accuracies follow the same pattern, with Gemini’s approximate top-1 accuracy reported as 7 versus 8.
Confidence rose sharply from provisional to final stages for most models. For example, Gemini 2.5 Pro moved from 9 to 0, and o4-mini from 1 to 2. This increase was not uniformly desirable: the benchmark reports that most models increased confidence in maintained diagnoses regardless of correctness, indicating confirmation bias. Correlation analysis further showed that diagnosis removal and diagnosis maintenance were positively associated with improvements in exact accuracy, approximate accuracy, and 3, while larger absolute confidence shifts 4 correlated with performance gains. The paper notes that some models, particularly DeepSeek and Llama, showed better calibration of confidence shifts toward correctness.
Information-seeking efficiency exhibited a characteristic asymmetry. Precision was generally higher than recall, suggesting that models were selective but incomplete in their inquiries. Bedside review through history and physical examination was better targeted than investigation ordering, whereas laboratory and imaging requests often showed low recall and, in many cases, low precision due to generic or niche tests unrelated to the case.
Specialty-level analysis reported relative strengths in Infectious Disease/Immunology and Cardiovascular/Metabolic, and weaknesses in Pediatric and Neurological/Psychiatric cases, which the authors associate with nuanced symptom interpretation and developmental context.
6. Failure modes and illustrative diagnostic trajectories
The benchmark identified four recurrent cognitive errors that mirror familiar clinical pitfalls: fixation or anchoring on initial hypotheses, excessive or inappropriate investigations, premature diagnostic closure or satisfaction-of-search, and failure to screen for critical or time-sensitive conditions. A separate LLM-based classifier grouped failures across cases into inappropriate hypothesis generation (348), premature diagnostic closure (291), inadequate investigations (90), and ICD coding errors (9), indicating that reasoning deficits dominated over pure knowledge gaps (Chiu et al., 11 Oct 2025).
The example transcripts clarify how these errors arise in interaction. In a pancreatitis case involving a 28-year-old male with severe epigastric pain radiating to the back, the agent appropriately reviewed pain characteristics and risk factors, found epigastric tenderness, prioritized acute pancreatitis provisionally, ordered amylase and lipase, and then stopped after confirming pancreatitis. It failed to investigate the underlying etiology and missed the documented duodenal ulcer-induced obstruction. The proximal diagnosis was correct, but the workup was incomplete.
In a posterior fossa stroke case involving facial droop, dysphagia, and severe hypertension, the agent localized to the brainstem and suspected pontine infarct, but ordered only non-contrast CT, which has low sensitivity for posterior fossa infarcts. After chronic changes were seen, the model concluded transient ischemic attack rather than the documented pontomedullary stroke. The error lay not in total ignorance of the syndrome but in inappropriate sequencing and interpretation of investigations under uncertainty.
A pediatric example involved a 4-week-old infant with poor feeding, “trouble breathing,” normal saturation, and bilateral crackles. The agent anchored on heart failure and cascaded into BNP, echocardiography, and ECG while failing to prioritize more likely pediatric etiologies given the age and presentation. This case illustrates how early miscalibration of the hypothesis set can distort all downstream information acquisition.
These examples support the benchmark’s broader claim that sequential reasoning can derail even when knowledge of canonical presentations is present. The clinically salient problem is therefore not only wrong answers, but wrong paths.
7. Relation to other benchmarks, limitations, and broader significance
VivaBench is explicitly contrasted with knowledge-centric QA benchmarks such as MedQA, PubMedQA, and MultiMedQA, which test factual recall or short reasoning over fully specified inputs. It is also distinguished from interactive medical environments such as AI Hospital, AgentClinic, and AMIE. The paper argues that those environments often rely on human-in-the-loop or non-deterministic multi-LLM components, whereas VivaBench emphasizes open-source structured cases, standardized coding through SNOMED-CT, LOINC, and ICD-10, deterministic information retrieval, and an extensible evaluation framework centered on diagnostic sequential decision-making (Chiu et al., 11 Oct 2025).
The released repository is hosted at https://huggingface.co/datasets/chychiu/VivaBench/. Each entry includes source metadata, the free-text vignette, ground-truth diagnoses and differentials, and a machine-readable JSON case object aligned with the evaluation framework. Cases are derived from published case reports and other publicly available sources; no private patient datasets were used in the released benchmark. Ethical review and formal licensing terms are not specified in the paper. The authors also report a contamination analysis suggesting that the evaluated models did not simply memorize the source PubMed reports.
Several limitations are stated directly. The dataset is modest relative to large QA corpora and is largely derived from case reports, which may bias it toward academic or atypical narratives. The deterministic mapper cannot capture the full variability of clinical communication, while the LLM mapper, despite high empirical determinism, introduces a residual non-determinism concern. Each model was run once because of computational cost, so stochastic variability across long horizons was not explored. The viva protocol itself is a simplification of real encounters: it omits cost, time-to-result, and multi-user constraints and provides only a limited toolset.
The broader significance of VivaBench lies in the specific capability gap it exposes: a persistent difference between encoded medical knowledge and the agentic capacity to plan, gather, revise, and calibrate under uncertainty. This suggests that advancing medical LLMs for clinical decision support will require progress in information-seeking recall, hypothesis revision, confidence calibration, and safety-critical screening, not merely stronger single-turn diagnostic synthesis. A plausible implication is that benchmarks of this form will remain important beyond medicine, because they make sequential reasoning trajectories explicit and measurable in complex, partially observable decision environments.