DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Published 10 May 2026 in cs.CV and cs.AI | (2605.09679v1)

Abstract: Medical vision-LLMs (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents a novel hierarchical benchmark that decomposes 3D CT diagnostic tasks into recognition, measurement, visual reasoning, and medical reasoning stages.
It demonstrates that tool-augmented agents and task-specific alignment significantly improve measurement accuracy and downstream diagnostic reasoning.
The study highlights that even minor measurement errors cause severe accuracy degradation, emphasizing the need for robust quantitative tools.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Diagnostic Evaluation of Medical VLMs and AI Agents

Introduction and Motivation

DeepTumorVQA introduces a rigorously constructed hierarchical benchmark for evaluating medical vision-LLMs (VLMs) and tool-augmented AI agents on 3D computed tomography (CT) diagnostic tasks. The benchmark addresses the limitations of prevailing medical VQA datasets, which typically collapse diagnostic reasoning into a single aggregate score, thereby obscuring the specific failure modes (e.g., inability to recognize lesions versus failure in integrating clinical criteria). DeepTumorVQA captures the hierarchical, multi-stage nature of radiological diagnosis and offers a comprehensive, taxonomically-structured framework for both direct image-question-answering and agent-based tool-augmented inference.

Benchmark Design, Task Taxonomy, and Data Composition

DeepTumorVQA is underpinned by a four-level evidence hierarchy, decomposing each diagnostic question into a progression of atomic to complex clinical tasks:

Recognition: Binary detection of anatomic or pathologic entities (e.g., tumor presence in liver).
Measurement: Quantitative assessment (e.g., organ/lesion volume, mean HU, tumor burden).
Visual Reasoning: Compositional tasks requiring spatial relationships, comparison, and aggregation (e.g., which kidney is larger, multi-organ lesion burden).
Medical Reasoning: Clinical integration tasks demanding application of literature-grounded guidelines and multi-step evidence synthesis (e.g., steatosis grading, cancer staging).

The dataset comprises 476K QA pairs across 42 subtypes, derived from 9,262 3D CT volumes collected from 17 diverse, publicly available datasets, with all annotations curated by a panel of 23 radiologists. The compositional logic for QA generation leverages deterministic programs grounded in validated clinical knowledge, ensuring reproducibility and supporting targeted capability attribution. Fine-grained distractor construction minimizes heuristic answer selection and enforces quantitative discrimination.

The framework supports evaluation under three input modalities: (i) full 3D NIfTI volumes, (ii) organ-focused 2D slices, and (iii) multi-slice videos. It includes a ReAct-style tool-augmentation environment, where agents may invoke four functional primitives (segmentation, measurement, medical knowledge lookup, region cropping).

Three agentic evaluation regimes are defined:

Oracle: Tools return ground-truth values from consensus annotations.
Predicted: Tools are backed by state-of-the-art segmentation models (e.g., TotalSegmentator).
Vision: Agents rely on visual crops without access to explicit numeric measurements.

This interventionist design allows controlled experiments to isolate perceptual, quantification, and reasoning failures.

Experimental Findings and Model Analysis

The authors benchmarked 30+ models spanning zero/few-shot 2D VLMs, 3D foundation models, LoRA-finetuned 2D architectures, and commercial APIs. The evaluation reveals several high-salience findings:

Measurement as the Bottleneck: All zero-shot 2D and 3D VLMs degrade to chance on four-option measurement tasks (24–32%), with error propagation into all downstream reasoning tasks. Finetuned models exhibit marked improvement (M3D-Phi3: 59.8%, Meissa 2D: 66.3%), confirming the necessity of task-specific alignment. Human radiologists, even with full access to volumetric data, achieve lower accuracy (junior: 45%, senior: 54.5%) due to human limitations in volumetric quantification without tool assistance.
Tool-Based Agent Augmentation: Access to oracle tools enables substantial accuracy gains in measurement (30%→65%) and medical reasoning (27–28%→51–54%). However, new failure modes emerge: tool selection and sequencing, parameterization, and knowledge tool under-utilization (only 1.3% usage when required in 19% of cases). Agent SFT (trace supervision) resolves action loop failures and improves medical reasoning, corroborated by direct error analysis.
Precision Sensitivity: Measurement accuracy is critically sensitive to noise; even 10% error in measurements results in –25% accuracy degradation on quantitative subtypes. Reasoning over imperfect evidence is thus fundamentally bottlenecked by segmentation and measurement fidelity.
Sufficiency and Generalization: For strong LLM backbones, measurements alone (i.e., without images) allow medical reasoning to match or exceed full pipeline performance, demonstrating that accurate quantification is sufficient for the majority of benchmarked clinical tasks. Compositional generalization tests confirm the benchmark is non-memorization prone; agent architectures do not degrade on OOD (organ, size) combinations.
2D Supervision vs. 3D Pretraining: LoRA-finetuned 2D models using only tiled multi-slice input outperform both 3D-pretrained and finetuned volumetric models. Task-aligned supervision is more decisive than input dimensionality.
Task-Type Specific Error Patterns: Recognition tasks are most robust to input and model variation; medical reasoning is universally difficult in the absence of quantitative evidence and explicit clinical knowledge integration.

Implications, Limitations, and Future Directions

Practical Implications

The criticality of accurate, high-fidelity measurement tools for AI medical diagnosis is empirically demonstrated. The framework underlines that end-to-end VLM performance is dominated by quantification and tool orchestration, not simply visual pattern recognition. The benchmark also motivates the necessity for fine-grained, evidence-level supervision in agent training, as SFT on tool-use traces dramatically reduces procedural and planning failures.

Clinically, the findings suggest that reliable, validated segmentation and measurement pipelines are prerequisite for deploying VLM or agent-based clinical decision support on 3D medical data. The transparent, stage-wise evaluation exposes shortcutting and superficially correct answers arising from training-set priors, which can be hazardous in a clinical deployment context.

Limitations

The evaluation remains constrained to abdominal CT and well-codified diagnostic subtypes, omitting numerous unsolved subtasks such as texture analysis, temporal progression, or rare pathologies.
The multiple-choice format, though carefully constructed, may not fully reflect open-ended diagnostic reasoning fidelity; however, extensive free-form response benchmarking validates that the principal findings are robust to answer interface.
Annotation quality, while adjudicated at the expert level, is subject to the intrinsic limitations of visual detection for small or ambiguous lesions.

Theoretical and Community Impact

DeepTumorVQA establishes a new standard for diagnostic benchmark granularity and transparency, moving the field away from black-box aggregate scoring toward modular, interpretable evaluation. The evidence hierarchy and agentification protocol will inform the design and analysis of future medical VLMs and agent architectures.

Future Research Directions

Improved tool-based agent training techniques, including reinforcement learning from human feedback on tool-use chains, to optimize not only end answers but also intermediate evidence acquisition.
Expansion to multi-modal and multi-organ tasks, integration of temporal and clinical-contextual cues, and extension to underrepresented imaging modalities (MRI, PET, chest/brain/pelvis domains).
Research into robustifying measurement tools under domain shift, rare events, and imperfect segmentation, as well as dynamic agent replanning under uncertain tool outputs.

Conclusion

DeepTumorVQA delivers a rigorously validated, hierarchically decomposed, and agent-aware 3D medical VQA benchmark. It exposes the pivotal role of quantitative measurement, tool orchestration, and clinical knowledge integration for reliable AI diagnostic systems, and provides the necessary infrastructure for causal attribution of failure modes in complex medical reasoning pipelines. The framework is poised to become essential infrastructure for the next generation of medical vision-language agents and evidence-grounded model evaluation.

Markdown Report Issue