DeepTumorVQA: CT Tumor Diagnosis Benchmark
- DeepTumorVQA is a tumor-centric visual question answering benchmark that leverages volumetric CT scans for staged evaluation of diagnostic reasoning.
- It employs a hierarchical approach across recognition, measurement, visual reasoning, and medical reasoning to pinpoint specific failure modes in clinical inference.
- The benchmark integrates structured tool interaction environments that facilitate precise, evidence-driven analysis of quantitative imaging data.
DeepTumorVQA is a tumor-centric medical visual question answering benchmark for volumetric computed tomography that was introduced to test whether vision-LLMs can progress from lesion recognition to clinically grounded diagnostic reasoning. In its 2025 presentation, it was described as a diagnostic VQA benchmark for abdominal tumors in CT scans, comprising 9,262 CT volumes, 3.7M slices, and 395K expert-level question-answer pairs across four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning (Chen et al., 25 May 2025). In its 2026 formulation, DeepTumorVQA was recast as a hierarchical 3D CT benchmark for stage-wise evaluation of medical VLMs and tool-augmented agents, with 476K questions across 42 clinical subtypes on the same 9,262 3D CT volumes, plus explicit tool-interaction environments and ground-truth tool-use traces (Chen et al., 10 May 2026). The benchmark’s defining premise is that tumor diagnosis is a multi-stage evidence chain, and that evaluation should localize failures at the level of perception, quantification, compositional reasoning, or clinical inference rather than obscure them within a single aggregate score.
1. Historical emergence and conceptual rationale
DeepTumorVQA was proposed against a background in which medical VQA benchmarks were characterized as mostly 2D, small, or overly simplistic, and therefore poorly matched to real radiological workflows for volumetric abdominal CT (Chen et al., 25 May 2025). The 2025 benchmark asked a direct question: whether current vision-LLMs are precise and intelligent enough for clinical diagnosis from 3D CT, especially in tumor-centric settings requiring small lesion recognition, numerical measurement, cross-slice integration, and application of medical rules.
The 2026 hierarchical benchmark sharpened that motivation. Its authors argued that existing evaluations collapse capability into a single accuracy score, even though a model may recognize lesions but fail to measure them, may measure correctly but fail in downstream reasoning, may answer from priors without genuine visual grounding, or may fail specifically in tool use rather than in clinical interpretation (Chen et al., 10 May 2026). DeepTumorVQA therefore frames diagnosis as a staged progression from lower-level perceptual primitives to higher-level clinical conclusions.
This design aligns with a broader methodological shift in medical multimodal research. Rather than treating VQA as generic image-text fusion, DeepTumorVQA treats oncologic QA as an evidence-tracing problem in which the validity of a high-level answer depends on access to lower-level measurable facts. This suggests that the benchmark is intended not merely as a leaderboard instrument, but as a failure-analysis framework for clinically oriented VLMs and agents.
2. Dataset composition and benchmark construction
Across its 2025 and 2026 descriptions, DeepTumorVQA is built from 9,262 3D CT volumes drawn from 17 public abdominal CT datasets spanning 88 medical centers (Chen et al., 25 May 2025). The 2025 paper reports 395K expert-level question-answer pairs, with a split of 355,962 training QA pairs from 8,334 CT and 39,650 testing QA pairs from 928 CT. The 2026 hierarchical version reports 476K questions total, including 428K train and 48K test, and evaluates models on a balanced 10,000-question subset from the test split, stratified across 42 subtypes and spanning 991 CT volumes (Chen et al., 10 May 2026).
The source corpora include abdominal datasets such as CHAOS, Pancreas-CT, BTCV, LiTS, CT-ORG, WORD, AMOS22, KiTS, the Medical Segmentation Decathlon CT tasks, AbdomenCT-1K, FLARE’23, and the RSNA 2023 Abdominal Trauma Detection dataset, among others (Chen et al., 25 May 2025). The 2025 description reports 7,629 lesions annotated by 23 board-certified radiologists, distributed across 3,067 liver, 4,078 kidney, 351 pancreatic, and 131 colon lesions. The 2026 benchmark reports 11,319 lesions, 43 anatomical structures per volume, and annotations by 23 radiologists (Chen et al., 10 May 2026).
Question generation is structured rather than free-form. The 2025 paper describes a modular, template-based process inspired by CLEVR-style compositional programs, in which structured metadata are extracted from segmentation masks and reports and then converted into benchmark questions (Chen et al., 25 May 2025). The 2026 benchmark states that subtype-specific deterministic programs generate QA pairs from 70+ structured attributes per scan, including organ volumes, lesion volumes, mean HU values, lesion counts, lesion diameters, cross-organ ratios, and staging fields (Chen et al., 10 May 2026). Higher-level answers are therefore programmatically grounded in measurable primitives rather than authored as unconstrained natural-language prompts.
A concise summary of the 2026 stage structure is given below.
| Stage | Subtypes | Representative tasks |
|---|---|---|
| Recognition | 9 | liver lesion existence, kidney cyst existence, PDAC existence, splenomegaly detection |
| Measurement | 5 | organ volume, organ HU, lesion volume, organ HU ratio, tumor burden percentage |
| Visual reasoning | 16 | lesion counting, largest lesion diameter, lesion location, kidney volume comparison |
| Medical reasoning | 12 | fatty liver, pancreatic steatosis, portal hypertension, PDAC vs PNET, pancreatic T staging |
The discrepancy between the 29 subtypes reported in 2025 and the 42 clinical subtypes reported in 2026 is itself informative. It suggests a later broadening of subtype coverage and a more explicit formalization of the benchmark’s evidence hierarchy.
3. Evidence hierarchy and clinical task semantics
DeepTumorVQA is organized around four reasoning stages: Recognition, Measurement, Visual Reasoning, and Medical Reasoning (Chen et al., 25 May 2025, Chen et al., 10 May 2026). The benchmark’s central innovation is that higher-level questions remain independently scorable as ordinary VQA items, while their ground-truth evidence chains are defined over lower-level primitives.
Recognition tests lesion or abnormality existence. In the 2025 version, it includes questions such as whether the liver, kidney, pancreas, or colon contains lesions, tumors, or cysts (Chen et al., 25 May 2025). In the 2026 hierarchy, recognition includes organ, lesion, and pathological finding detection, with splenomegaly detection explicitly noted as sitting near the recognition/measurement boundary because it depends on a quantitative threshold (Chen et al., 10 May 2026).
Measurement asks the model to extract numerical evidence from CT or metadata. Reported measurement tasks include organ volume, organ mean Hounsfield Unit, lesion volume, organ HU ratio, and tumor burden percentage (Chen et al., 25 May 2025, Chen et al., 10 May 2026). The 2025 paper evaluates numerical free-text answers with mean relative accuracy, whereas the 2026 benchmark also supports multiple-choice evaluation and tool-based retrieval of exact quantities.
Visual reasoning composes perceptual and quantitative primitives. Representative tasks include lesion counting, largest lesion diameter, lesion location, lesion attenuation, left-vs-right kidney volume comparison, organ aggregation, lesion outlier detection, adjacency, organ enlargement, bilateral asymmetry, and multi-organ burden comparison (Chen et al., 25 May 2025, Chen et al., 10 May 2026). These tasks require integrating multiple measurements or spatial relations rather than reading out a single attribute.
Medical reasoning applies medical rules or external clinical knowledge to image-derived evidence. Examples include fatty liver diagnosis, hepatic steatosis grading, pancreatic steatosis, pancreatic cyst resectability, pancreatic lesion resectability, pancreatic tumor staging, PDAC vs PNET, portal hypertension, renal mass characterization, pseudocyst determination, and splenomegaly grading (Chen et al., 25 May 2025, Chen et al., 10 May 2026). The 2025 appendix gives explicit task logic for several subtypes, including lesion outlier as whether the largest lesion is more than 3× the second largest, pancreatic cyst resectability as a binary classification based on cyst volume , and pancreatic steatosis using a pancreas/spleen HU ratio with the rule (Chen et al., 25 May 2025).
The benchmark’s evidence-chain formalism is most apparent in questions such as fatty liver, where the final answer is clinically categorical but the ground truth depends on lower-level measurements such as liver HU and spleen HU (Chen et al., 10 May 2026). This structure makes it possible to distinguish a clinically correct answer supported by correct intermediate evidence from a correct final answer produced for the wrong reason.
4. Evaluation protocols, metrics, and tool-interaction environments
The 2025 benchmark uses task-specific metrics: accuracy for multiple-choice questions, exact match for free-text categorical answers, and mean relative accuracy (MRA) for numerical free-text outputs, with the note that free-text numerical subtypes are evaluated using MRA and that higher is better (Chen et al., 25 May 2025). The 2026 hierarchical benchmark evaluates direct inference on 10,000 fixed multiple-choice questions with greedy decoding and reports answer accuracy by subtype, task type, and overall performance, alongside latency and trajectory statistics for tool-using agents (Chen et al., 10 May 2026).
The 2026 paper defines MRA for free-form numeric scoring as
with special handling when (Chen et al., 10 May 2026). For agent trajectories, it reports tool-set Jaccard,
as well as parameter accuracy, average steps, and valid prediction rate.
A major extension in the 2026 benchmark is the introduction of tool-interaction environments in a ReAct-style loop of up to 8 steps (Chen et al., 10 May 2026). The available tools are:
segment_organ(target): returns segmentation statistics for 43 targets.measure(target, type): returns values such as volume, mean HU, diameter, and count.lookup_medical_knowledge(query): returns criteria from a curated 27-entry knowledge base.crop_region(organ): returns organ-focused crop images in vision mode.
The tools may return oracle quantities from expert masks, predicted quantities from TotalSegmentator, or crop-based visual outputs (Chen et al., 10 May 2026). DeepTumorVQA also provides deterministic ground-truth tool-use traces for all 42 subtypes, implemented with regex-style parameter matching. This makes it possible to evaluate not only whether an agent answered correctly, but also whether it selected the right tools, invoked them in the right order, avoided redundant loops, and completed the intended evidence chain.
This tool-augmented setting materially changes the meaning of benchmark success. Direct reasoning evaluates raw multimodal inference; agent evaluation additionally measures planning, quantification access, and the translation of tool outputs into clinically valid decisions.
5. Empirical results and the measurement bottleneck
The 2025 DeepTumorVQA evaluation benchmarks RadFM, M3D, Merlin, and CT-CHAT, with two M3D language-backbone variants, and finds a consistent performance ordering: measurement tasks are easiest, recognition and reasoning are harder, and medical reasoning is hardest (Chen et al., 25 May 2025). RadFM is reported as the strongest model, with 0.812 recognition accuracy in free-text, 0.629 medical reasoning accuracy in free-text, and the highest total average free-text score of 0.555; in multiple-choice evaluation, RadFM also leads with a total average of 0.662 (Chen et al., 25 May 2025). The same paper concludes that current VLMs are still not meeting clinical needs, especially for small tumor detection, nuanced lesion localization, 3D integration, and clinically grounded reasoning.
Several finer-grained findings from the 2025 study are notable. Lesion size alone is reported as not being a reliable predictor of recognition, whereas HU contrast is: for RadFM, sensitivity rises with larger lesion-to-organ HU difference, while size trends are inconsistent for liver and pancreatic lesions (Chen et al., 25 May 2025). The paper also shows that anatomical preprocessing can dominate nominal architecture differences: wrapping models with nnU-Net-based organ localization yields major recognition gains, and nnM3D-LLaMA2 improves kidney tumor sensitivity from 0% to 80.9%, surpassing RadFM on that task (Chen et al., 25 May 2025).
The 2026 hierarchical benchmark reframes these findings as a stage-wise bottleneck analysis. Across zero-shot 2D models, Recognition is reported at about 50–58%, whereas Measurement drops to about 24–32%, near chance (Chen et al., 10 May 2026). Pretrained 3D specialists without task-specific fine-tuning are described as surprisingly weak, while after fine-tuning the best 2D LoRA model reaches 66.3% and the best 3D fine-tuned model reaches 59.8% (Chen et al., 10 May 2026). Two radiologists evaluated on a subset achieve 45.0% and 54.5%, with the paper noting that exact quantitative measurement from raw CT is difficult even for humans (Chen et al., 10 May 2026).
The dominant conclusion of the 2026 benchmark is explicit: reliable quantitative measurement is the primary bottleneck (Chen et al., 10 May 2026). This is not a marginal observation. It is presented as the mechanism through which failures propagate upward: once measurement collapses, both visual reasoning and medical reasoning degrade because their prerequisite evidence is unavailable or unreliable.
6. Tool augmentation, transfer, and broader significance
Tool augmentation in DeepTumorVQA substantially alters the performance profile of current systems. When oracle tools are introduced, Measurement accuracy rises from roughly 29–31% to 64–65%, and Medical reasoning rises from roughly 27–28% to 51–54% (Chen et al., 10 May 2026). With predicted tools, the paper reports that predicted mode retains about 93% of the oracle-to-direct gain overall, indicating that imperfect but structured quantification can still unlock much of the latent reasoning capability of VLMs (Chen et al., 10 May 2026). By contrast, crop-only visual tools can make performance worse than direct inference because models hallucinate numeric estimates from organ crops rather than accessing quantitative evidence directly.
The same paper shows that ground-truth tool traces can supervise agents. Meissa-4B, trained on about 20K tool traces, improves overall accuracy from 46.0% to 63.8%, raises Measurement accuracy to 78.3%, and reduces the step-limit hit rate from about 17–18% to 0.2–0.3% (Chen et al., 10 May 2026). This result indicates that the benchmark is not only evaluative but also instructional: its traces can function as supervision for tool orchestration and evidence-grounded reasoning.
DeepTumorVQA has also become a reference point within tumor-VQA research. The tumor-centric reasoning framework TumorChain reports explicit zero-shot generalization on DeepTumorVQA, achieving 73.30 in recognition, 53.31 in visual reasoning, 45.93 in medical reasoning, and 57.51 average, which the paper attributes to interleaved chain-of-thought, ROI-guided local evidence extraction, and auxiliary abnormality supervision (Li et al., 6 Mar 2026). The phrase “DeepTumorVQA-style benchmark” is used directly in UCSF-PDGM-VQA, a brain-tumor MRI benchmark built from real 3D multi-sequence studies, indicating that DeepTumorVQA has come to denote a broader design pattern: clinically grounded, volumetric, tumor-centered VQA with meaningful reasoning demands (Ghosh et al., 16 May 2026).
Related benchmarks also clarify the confounds that DeepTumorVQA attempts to expose. The contamination-controlled oncology VQA benchmark of 2026 shows that visual reliance is highly dataset-specific: liver questions genuinely require the image, whereas Lung CT can be essentially solvable without it, with blind accuracy matching or exceeding sighted accuracy in some settings (Liu et al., 1 Jun 2026). In brain tumor MRI, UCSF-PDGM-VQA reports modality collapse, including cases in which multimodal systems perform similarly or better with a blank image and strong option-order bias in the original generated dataset (Ghosh et al., 16 May 2026). These findings do not arise from DeepTumorVQA itself, but they sharpen the significance of its hierarchical evidence-chain design: aggregate accuracy alone is not a trustworthy proxy for genuine visual reasoning.
DeepTumorVQA’s broader importance therefore lies in its redefinition of medical VQA evaluation. It treats tumor diagnosis as a structured pipeline from recognition to quantification to compositional visual reasoning to medical rule application, then extends that pipeline into explicit tool use and trace supervision (Chen et al., 10 May 2026). A plausible implication is that the benchmark’s enduring contribution is methodological rather than merely numeric: it provides a concrete framework for separating failures of seeing, measuring, reasoning, and acting in 3D oncologic AI systems.