Papers
Topics
Authors
Recent
2000 character limit reached

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology (2511.20490v1)

Published 25 Nov 2025 in cs.LG and cs.AI

Abstract: Multimodal LLMs hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.

Summary

  • The paper introduces a novel benchmark that replicates Molecular Tumor Board workflows using multimodal, longitudinal clinical data.
  • It evaluates model performance across both multimodal and sequential tasks, demonstrating enhanced diagnostic accuracy with targeted tool integration.
  • The study underscores the importance of foundation model augmentation in improving clinical decision-making in oncology.

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Introduction and Motivation

The development of Multimodal LLMs (MLLMs) and their integration into clinical AI research have highlighted the gap between existing QA benchmarks and the realities of clinical workflows, particularly those exemplified by Molecular Tumor Boards (MTBs). MTBs require agents to integrate heterogeneous and temporally distributed multimodal data—including pathology slides, IHC images, genomics, and longitudinal clinical events—to support high-stakes diagnosis, prognosis, and therapeutic decision-making. Traditional medical AI benchmarks fail to adequately capture this complexity, generally restricting evaluation to unimodal, static, and context-deprived queries.

The "MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology" (2511.20490) directly addresses these shortcomings by introducing a new agentic benchmarking framework closely mirroring MTB-style workflows. The benchmark is designed to simulate realistic, multi-agent, and temporally evolving oncology decision-making, tightly coupled with expert clinical validation and tool-augmented agent architectures.

Benchmark Construction and Clinical Realism

MTBBench consists of two primary evaluation tracks: (1) Multimodal and (2) Longitudinal. In both setups, agents interact with patient cases in a staged, multi-turn dialogue, accessing only those files and data corresponding to the current clinical scenario. For the multimodal track, 26 cases with rich H&E, IHC, hematologic, and surgical data are drawn from the Hancock (HC) dataset, with 390 QA pairs representing fine-grained, expert-curated clinical tasks. The longitudinal track utilizes 40 deeply annotated cases from the MSK clinicogenomic cohort, focusing on outcome, recurrence risk, and therapy progression, with 183 expert-validated QA pairs.

A central feature is the dynamic, agentic workflow: agents must actively request information (e.g., slides, labs, timelines), and manage non-persistent, context-limited memory, reflecting the actual constraints and practices of MTB processes. Figure 1

Figure 1: The MTBBench framework simulates MTB workflows, demanding integration of multimodal, longitudinal data, and benchmarking agentic decision-making with realistic file management and tool use.

Expert validation is built into the benchmark through a custom companion application, where clinicians review the exact context, access images and lab files, and provide feedback on each QA item. Inter-rater reliability metrics indicate high agreement and strong question quality, supporting the benchmark’s clinical validity.

Agentic Framework and Tool-Augmented Reasoning

A key innovation in MTBBench is its modular agentic framework, permitting integration with foundation model-based tools and structured databases. The benchmark framework exposes foundation models (FMs) for high-resolution digital pathology and IHC analysis as callable tools: agents can invoke CONCH for pathology images, UNI2+ABMIL for IHC quantification, as well as PubMed and DrugBank modules for literature and pharmacological knowledge. This design supports flexible, iterative, and context-aware tool use—an explicit reflection of MTB team dynamics, where specialists sequentially consult different data streams and domain experts.

Empirical analysis confirms that model performance (accuracy) is substantially correlated with active, targeted information access; models which request more data files (i.e., cross-modality integration) attain higher performance. Figure 2

Figure 2: Model accuracy increases strongly with the number of files/modalities accessed per question across both multimodal and longitudinal benchmarks.

Experimental Findings

A systematic evaluation of both open and closed-source LLMs/VLMs—including GPT-4o, Gemma, Qwen, InternVL, and others—across both tracks reveals significant, previously unquantified performance gaps:

  • Baseline accuracy for best-performing models peaks near 70% on multimodal tasks (InternVL-78B: 69.1%), with outcome and recurrence prediction tasks remaining at or near chance for most models.
  • Hematological reasoning is tractable for current LLMs, likely due to structured input, while digital pathology and longitudinal tasks expose consistent weaknesses.
  • Tool augmentation yields non-trivial performance gains: up to +9% (digital pathology) and +11% (longitudinal reasoning) at the task level.
  • Smaller and less capable models benefit disproportionately from tool integration, particularly in visually complex or reasoning-intensive tasks. Figure 3

    Figure 3: Access to foundation model tools systematically improves agent accuracy across both multimodal and longitudinal tasks; the effect is pronounced in smaller models and visually demanding settings.

Qualitative analysis of agent traces demonstrates that correct answers are typically contingent on comprehensive file access and purposeful tool use, with robust grounding in explicit evidence. Models that shortcut information gathering, or fail to leverage cross-modal representations, exhibit high hallucinatory rates and reduced reliability—particularly in scenarios involving ambiguous, evolving, or conflicting clinical evidence.

Implications, Limitations, and Future Directions

The introduction of MTBBench represents a decisive shift from pattern-matching, static QA formulations toward agentic, decision-centric evaluation tailored to real-world MTB processes. This has several implications:

  • Benchmarking Clinical Agent Robustness: MTBBench enables comparative and quantitative paper of reliability, information-seeking behavior, and reasoning capacity in next-generation MLLMs deployed in oncology. The explicit file-access record introduces a reproducible interpretability signal, supporting audit and analysis of reasoning failures.
  • Tool Use as an Enabler of Clinical Performance: The integration of FMs for digital pathology and longitudinal evidence retrieval demonstrates empirically that tool augmentation, not just model scale, is a fundamental factor for robust clinical AI performance.
  • Resource for Generalist Agent Development: By making agent logs, data, and expert-annotated QA sets openly available, MTBBench lowers the barrier for the validation and development of generalist clinical agents capable of evidence synthesis, cross-modal retrieval, and longitudinal hypothesis management. Figure 4

    Figure 4: MTB process schematic illustrating the iterative, multi-expert, data-integrative decision flow that motivates the agentic framework of MTBBench.

However, some limitations persist. MTBBench remains an offline and controlled testbed—it does not yet address open-ended, ambiguous cases, adaptive clarification, or the full real-time interplay characterizing live MTB deliberations. Most current FMs are not trained for multi-stage or temporally causal clinical inference, and the benchmark presently centers on head and neck oncology cohorts; extension to other organ sites, radiology, or omics modalities would further generalize its scope.

Conclusion

MTBBench (2511.20490) establishes a new standard for realistic, agentic evaluation of clinical AI in precision oncology. Its contribution is two-fold: first, a thoroughly validated, multimodal, and longitudinal benchmark mirroring MTB complexity; second, an extensible agentic framework demonstrating the value and necessity of tool-augmented, foundation model-enabled reasoning. As the field advances toward interactive, autonomously reasoning clinical agents, MTBBench provides a rigorous substrate for both comparative evaluation and the safe, scalable development of future AI collaborators in oncology.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 45 likes about this paper.