MTBBench: Molecular Tumor Board Benchmark
- MTBBench is a clinically grounded evaluation suite that simulates multidisciplinary tumor board decision-making using multimodal clinical data including digital pathology, lab results, and genomics.
- Its agentic, multi-turn workflow requires models to actively retrieve data and perform temporal reasoning over evolving patient histories, mimicking real clinical interactions.
- Tool augmentation in MTBBench significantly boosts diagnostic, prognostic, and therapeutic decision accuracy, as validated by expert clinician reviews and comparative performance metrics.
MTBBench (Molecular Tumor Board Benchmark) is a clinically grounded, agentic evaluation suite designed to assess how well large language and vision-LLMs can simulate the multidisciplinary decision-making workflow of a molecular tumor board (MTB) in oncology. Unlike earlier biomedical benchmarks that focus on static, unimodal question-answering, MTBBench embeds models within a multi-turn "doctor-agent" protocol that requires active data retrieval, temporally unfolding patient histories, and complex reasoning over heterogeneous, multimodal inputs including digital pathology, lab results, genomics, and longitudinal clinical timelines (Vasilev et al., 25 Nov 2025).
1. Benchmark Architecture and Modalities
MTBBench is organized to emulate MTB workflows where diagnostic, prognostic, and therapeutic decisions arise from longitudinal, multimodal information integration. The data sources cover:
- Digital histopathology: Hematoxylin & eosin (H&E) and immunohistochemistry (IHC) whole-slide images annotated for regions of interest.
- Laboratory Data: Preoperative hematology (CRP, MPV, leukocyte differentials, creatinine), structured in CSV with reference ranges.
- Genomics & Pathology Reports: Somatic mutation calls, copy-number alterations, and pathology narratives.
- Clinical Timelines: Structured event sequences logging diagnoses, procedures, treatments, and outcomes.
The case corpus consists of two tracks:
- MTBBench-Multimodal: 26 head-and-neck cancer patients (from the Hancock dataset), each with ≈40 modality-specific files and ≈15 QA pairs per patient (390 total).
- MTBBench-Longitudinal: 40 "MSK" clinicogenomic cohort patients, each with ≈5 structured files and ≈4.6 QA pairs per patient (183 total).
2. Agentic Workflow and Interaction Protocol
In the agentic setting, the benchmark frames model evaluation as interactive, multi-turn dialogues. At each round , the agent receives a question and a list of available files ; it may request a subset , retrieve the corresponding files (images, tables, reports), update its internal state, and respond. Files are not persistent across turns, simulating the episodic nature of EMR access; they may be re-requested if needed for future inference. For the longitudinal track, an evolving timeline accumulates all previous events, forming the substrate for temporal reasoning.
Canonical clinical task categories include:
- Diagnostic Classification (Digital Pathology)
- Spatial Biomarker Interpretation (IHC)
- Preoperative Hematology Reasoning
- Prognostic Deliberation (Outcome & Recurrence)
- Longitudinal Outcome & Progression Forecasting
Illustrative interactions require the agent to identify tumor subtypes based on H&E, interpret IHC spatial patterns, estimate bleeding risk from lab values, predict survival, and infer progression risk after therapy and new molecular findings.
3. Ground Truth Annotation and Validation
All QA pairs within MTBBench were co-developed and validated with domain clinicians using a Streamlit-based companion application. Clinicians could:
- Explore patient demographics, grouped images, lab data, and genomics.
- Review each multiple-choice or true/false question with options to approve, reword, or flag as inappropriate.
For inter-rater reliability, ten external oncology experts reviewed overlapping question subsets (45 QA pairs), establishing high validation rates: mean pairwise accuracy 0.94, Cohen's 0.81, Fleiss' 0.79, PABAK 0.91, Gwet's AC1 0.91, Krippendorff's 0.79.
4. Evaluation Metrics, Model Baselines, and Tool-Augmentation
Performance is measured using accuracy: and, when relevant, binary F1-score: and percentage improvement for tool-augmented versus baseline conditions: File-access count per question is also reported, serving as a proxy for information-seeking depth, with higher counts correlating with more comprehensive multimodal reasoning.
Ten open- and closed-source large models (e.g., Gemma-12, GPT-4o, InternVL-78, Qwen2.5-VL-7, Llama-70) were benchmarked. Three prominent findings:
- Tool augmentation consistently increased accuracy: e.g., Gemma-12 improved from 61.5% (baseline) to 70.5% (+14.6%), GPT-4o from 66.7% to 72.9% (+9.3%) on the multimodal track.
- Longitudinal reasoning also benefited: Qwen2-VL-32 rose from 67.0% to 74.2% (+10.7%) on longitudinal tasks.
- Maximum task-level improvements reached 9.0% (digital pathology) and 11.2% (progression forecasting).
Comparative Performance (selected models)
| Model | Multimodal Baseline | Multimodal +Tools | Longitudinal Baseline | Longitudinal +Tools |
|---|---|---|---|---|
| Gemma-12 | 61.5% | 70.5% | 58.0% | 63.5% |
| GPT-4o | 66.7% | 72.9% | 64.2% | 67.8% |
| Qwen2-VL-32 | – | – | 67.0% | 74.2% |
5. Foundation Model-Based Tool Stack
MTBBench integrates domain-specific foundation model-based tools as agent-callable functions for in-context expert reasoning:
- Digital Pathology (CONCH): Dual-encoder model trained on 1.17 M H&E image–caption pairs for image-to-label matching via dot-product in a joint embedding space.
- Immunohistochemistry Quantification (UniV2 + ABMIL): 1,536-dimensional patch embeddings generated via UNI2; aggregated by an attention-based multiple instance learning (ABMIL) head with 5-layer FC regressor, trained for 70 epochs using Adam on NVIDIA A100 hardware.
- Knowledge Bases: PubMed tool (retrieves top abstracts using BAAI-bge reranker) and DrugBank tool (clinical drug metadata mapping).
Seamless agent-tool integration enables on-demand access to digital pathology quantification, biomarker detection, and biomedical evidence retrieval, mirroring the MDT's (multidisciplinary team) use of specialized resources during MTB meetings.
6. Core Findings and Model Insights
Analysis revealed several critical phenomena:
- Hallucinations and Information-Seeking Failures: Baseline agents frequently fabricated file names, failed to retrieve novel critical files, or reused stale context—most commonly in outcome/recurrence prediction where performance hovered near random (≈50% accuracy).
- Temporal Reasoning Deficiencies: Models detected coarse survival signals but lacked fine-grained discrimination for progression-vs-recurrence without accurate re-access of evolving event timelines.
- Multimodal Integration: The number of files accessed (not model size) most strongly correlated with performance (), underscoring that successful models must combine retrieval and integration of diverse modalities to achieve clinical accuracy.
Tool-augmented models demonstrated greater robustness to these challenges, though limitations persisted regarding deep longitudinal reasoning.
7. Limitations and Future Directions
Despite its rigor, MTBBench has several current limitations:
- Offline Simulation: Evaluation remains decoupled from live clinical settings; real-time patient-physician-agent dialogues are not yet simulated.
- Dataset Scale: The number of cases (66 patients) restricts statistical power for rare cancer subtypes and event types.
- Tool Coverage: No specialized agentic foundation model exists for temporal reasoning; longitudinal improvements stemmed primarily from general-purpose literature/drug lookups.
- Clarification and Ambiguity: Tasks assume complete, non-ambiguous data; interactive clarification protocols are planned for future benchmarks.
Planned expansions include broader cancer/organ coverage, support for additional imaging (e.g., radiology), real-time clinical trial database integration, and piloting MTBBench in simulated clinician–AI collaboration studies to assess usability and trust.
MTBBench establishes a rigorous, reproducible framework for measuring, analyzing, and improving the clinical reasoning abilities of AI agents in precision oncology, moving beyond traditional static QA to emulate and advance dynamic, tool-enabled workflows characteristic of real molecular tumor boards (Vasilev et al., 25 Nov 2025).