MedFlow-Bench: Study-Level Imaging Benchmark
- MedFlow-Bench is a benchmark that assesses medical imaging agents navigating complete, uncurated clinical studies through interactive evidence collection and auditable protocols.
- It leverages the MedOpenClaw runtime and 3D Slicer to simulate realistic radiology workflows across multi-sequence MRI and CT/PET exams.
- The benchmark addresses challenges in tool-use and spatial grounding by emphasizing reproducible action logs and detailed performance metrics.
Searching arXiv for the benchmark paper and closely related benchmark context papers. MedFlow-Bench, stylized as MedFlow-Bench in the originating paper, is a study-level benchmark for evaluating medical imaging agents on full, uncurated clinical studies rather than pre-selected 2D images. Implemented on top of the MedOpenClaw runtime, it converts a radiology exam into an auditable “episode” in which an agent must navigate the study, gather evidence, and produce a final answer within a bounded action interface operating through standard medical viewers such as 3D Slicer (Shen et al., 25 Mar 2026). The benchmark was introduced alongside MedOpenClaw in “MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies” and is motivated by a discrepancy between conventional medical VQA setups and actual clinical workflow: static-image benchmarks suppress the need to inspect complete 3D volumes across multiple sequences or modalities, adjust viewing parameters, perform measurements, and integrate evidence over a study-level decision process (Shen et al., 25 Mar 2026).
1. Definition and conceptual scope
MedFlow-Bench is designed to evaluate medical imaging agents operating over full, uncurated clinical studies rather than curated image snippets. In the benchmark formulation, a full exam becomes an auditable episode that the agent must navigate, gather evidence for, and solve. This framing places emphasis not only on final-answer correctness but also on the execution trace: series selection, slice navigation, window or fusion adjustments, evidence capture, and, where permitted, tool invocation are all logged and replayable (Shen et al., 25 Mar 2026).
The benchmark differs from conventional 2D medical imaging benchmarks in several stated ways. It provides full-study interactive access rather than static inputs; includes cross-modality cases, specifically multi-sequence MRI and CT/PET; requires active exploration and agentic execution; and exposes a transparent execution trace with evidence objects. The required workflow may include selecting series, navigating slices, adjusting windowing or fusion, invoking expert tools where allowed, and justifying final answers. This design is explicitly intended to reflect the fact that radiologists inspect entire 3D exams across multiple sequences or modalities rather than answer localized questions on pre-selected views (Shen et al., 25 Mar 2026).
A central architectural feature is its dependence on MedOpenClaw. MedFlow-Bench is implemented on top of that runtime, which externally drives 3D Slicer through a bounded, documented action interface via WebServer REST endpoints and bridge handlers. The viewer source code is not modified. All interactions are logged, allowing replay and audit. This suggests that the benchmark is as much an evaluation protocol for controlled agentic execution as it is a dataset definition (Shen et al., 25 Mar 2026).
2. Runtime environment, auditability, and bounded interaction
The benchmark’s execution model is inseparable from its auditable runtime. MedOpenClaw exposes only a bounded callable surface: only predefined operations are available, and raw Python execution inside 3D Slicer is disallowed. Action logging records every tool invocation together with its arguments, the resulting viewer-state snapshot, and generated artifacts such as bookmarked views, masks, measurements, and exported SEG. Artifacts and outputs are tied to specific actions and remain externally inspectable (Shen et al., 25 Mar 2026).
The permission model is also specified in operational terms. Interactions are mediated via documented 3D Slicer WebServer REST endpoints and named bridge handlers, including handlers for DICOM import, quantitative measurement, and DICOM SEG export. According to the paper, this minimizes attack surface and promotes auditability. Deterministic seed handling, however, is not specified (Shen et al., 25 Mar 2026).
Typical primitive viewer actions include selecting series, scrolling slices, adjusting window or level, toggling modality fusion, and bookmarking views. Evidence operations include capturing or exporting images of views, masks or segmentations, and measurement logs. Expert tools may include a MONAI-based tool pack with segmentation utilities and local thresholding, invoked with spatial parameters and configuration settings through MedOpenClaw. The agent interface operates through HTTP requests to Slicer REST endpoints and bridge handlers with explicit arguments, and outputs are returned as visual artifacts and logs (Shen et al., 25 Mar 2026).
This auditability differentiates MedFlow-Bench from many medical imaging benchmarks that score only end predictions. Here, the benchmark formalizes a replayable execution trace. A plausible implication is that it supports post hoc failure analysis at the level of navigation policy, spatial grounding, and tool-parameter choice rather than only task accuracy.
3. Clinical modules and benchmark composition
The initial release contains two clinical modules. The first is multi-sequence brain MRI for preoperative brain tumor diagnosis, built from UCSF-PDGM, the University of California San Francisco Preoperative Diffuse Glioma MRI dataset. Representative sequences in agent traces include T1 post-contrast, FLAIR, T2, and T1. The task is case-level diagnosis over a fixed label set (Shen et al., 25 Mar 2026).
The second module is lung CT/PET using an NSCLC radiogenomics cohort with paired CT/PET and pathology annotations. For each case, five structured predictions are defined: tumor location, pathological T stage, pathological N stage, histology, and histopathological grade (Shen et al., 25 Mar 2026).
The paper also states several things that are not specified. It does not provide precise counts of studies or patients, institutions beyond the published datasets, or per-case series counts. It does not specify data formats such as DICOM versus NIfTI, voxel spacing or resolution, slice thickness, or preprocessing and standardization steps. Train, validation, and test splits, licensing terms, access instructions, and de-identification or privacy procedures are likewise not given, although the modules are sourced from public datasets (Shen et al., 25 Mar 2026).
These omissions are significant for reproducibility in the conventional dataset-release sense. MedFlow-Bench is therefore precisely specified as an interaction-and-evaluation framework over public study sources, but incompletely specified as a data-handling protocol.
4. Tracks, tasks, and operational taxonomy
MedFlow-Bench defines three tracks, all using the same cases and canonical answer schemas.
| Track | Allowed actions | Stated role |
|---|---|---|
| Track A: Viewer-Only | Primitive viewer operations through MedOpenClaw and 3D Slicer REST | Emphasizes visual search, slice-to-slice synthesis, and reasoning across sequences |
| Track B: Tool-Use | Viewer-Only actions plus evidence operations and optional expert tools | Tests whether tool-augmented execution improves diagnostic performance |
| Track C: Open-Method | Any pipeline that consumes the raw study and produces answers in the canonical schema | Keeps the benchmark a universal standard |
Track A permits primitive viewer operations including series selection, slice scrolling, window or level adjustment, fusion toggling, and bookmarking views. No expert analysis tools are allowed. The protocol is a study-level episode in which the agent must navigate and answer under a strict answering protocol (Shen et al., 25 Mar 2026).
Track B adds evidence operations such as artifact capture or export, masks, and measurement logs, together with optional expert tools such as MONAI-based segmentation or quantitative analysis modules. Agents must decide when and how to invoke these tools and integrate their outputs into reasoning, all within MedOpenClaw’s bounded API. Raw Python execution inside Slicer remains prohibited. The evaluation target is whether tool-augmented execution improves diagnostic performance and whether the agent can parameterize tools and interpret returned artifacts (Shen et al., 25 Mar 2026).
Track C allows any pipeline, potentially bypassing MedOpenClaw and 3D Slicer, provided it consumes the raw study and emits answers in the canonical schema. The paper describes this as having no constraints beyond standard benchmark rules and as preserving universality (Shen et al., 25 Mar 2026).
The task taxonomy spans study-level classification and triage, cross-modality evidence gathering, spatial localization, and structured clinical prediction. The brain module exemplifies case-level diagnosis from multi-sequence MRI. The lung module requires tumor localization by navigating CT and PET volumes and fused views, along with staging, histology, and grade prediction, potentially assisted by measurements or segmentations in Track B (Shen et al., 25 Mar 2026).
5. Evaluation protocol and reported baselines
The benchmark uses two answer protocols. Under the MCQ protocol, explicit answer options are provided. Under the open-ended protocol, options are removed and an LLM judge compares free-form outputs against canonical answers as a secondary robustness check. The paper defines brain MRI performance by case-level accuracy, and lung CT/PET by case-exact accuracy as the primary metric with question-level accuracy as auxiliary. Some results also report average tool calls as “Accuracy (Avg. Tool Calls)” to reflect interaction effort (Shen et al., 25 Mar 2026).
The paper does not specify LaTeX formulas for accuracy, success rate, time-to-solve, action efficiency, spatial grounding error, or measurement error. For lung CT/PET, “Overall” reflects aggregate performance across the five subtasks, but the weighting scheme is not specified. Auditing criteria are enforced through runtime logs and bounded APIs, but explicit scoring deductions beyond accuracy are not specified (Shen et al., 25 Mar 2026).
In the Viewer-Only track under the MCQ protocol, the paper reports the following baseline results.
| Model | Brain MRI case-level accuracy | Lung CT/PET Overall accuracy |
|---|---|---|
| GPT-5.4 | 0.61 | 0.32 |
| GPT-5-mini | 0.43 | 0.20 |
| gemini-3.1-flash-preview | 0.56 | 0.52 |
| gemini-3.1-pro-preview | 0.63 | 0.31 |
Average tool calls are also reported in the same results: Brain MRI—GPT-5.4, 5.9; GPT-5-mini, 2.24; gemini-3.1-flash-preview, 9.6; gemini-3.1-pro-preview, 7.2. Lung CT/PET Overall—GPT-5.4, 11.5; GPT-5-mini, 1.85; gemini-3.1-flash-preview, 19.6; gemini-3.1-pro-preview, 11.7 (Shen et al., 25 Mar 2026).
For lung CT/PET subtasks, the reported accuracies are as follows.
| Model | Tumor Location | T Stage | N Stage | Histology | Grade |
|---|---|---|---|---|---|
| GPT-5.4 | 0.46 | 0.32 | 0.38 | 0.36 | 0.07 |
| GPT-5-mini | 0.14 | 0.17 | 0.09 | 0.06 | 0.04 |
| gemini-3.1-flash-preview | 0.42 | 0.21 | 0.83 | 0.72 | 0.44 |
| gemini-3.1-pro-preview | 0.43 | 0.31 | 0.35 | 0.34 | 0.11 |
These results establish two benchmark observations stated in the paper. First, state-of-the-art LLMs and VLMs can successfully navigate the viewer to solve a meaningful fraction of study-level tasks. Second, performance varies substantially by module and subtask, with the lung module exposing stronger heterogeneity across tumor localization, staging, histology, and grade (Shen et al., 25 Mar 2026).
6. Tool-use degradation and spatial grounding as the central bottleneck
One of the benchmark’s most distinctive findings is that performance can degrade when professional tools are made available. In the Tool-Use track, adding segmentation toolpacks via MedOpenClaw yields the following changes: GPT-5.4 drops from 0.61 to 0.57 on brain MRI and from 0.32 to 0.27 on lung CT/PET; GPT-5-mini increases slightly from 0.43 to 0.45 on brain MRI but drops from 0.20 to 0.14 on lung CT/PET (Shen et al., 25 Mar 2026).
The paper attributes this degradation to insufficient fine-grained spatial grounding. Agents must provide precise spatial inputs, including millimeter-level coordinates, in order to seed segmentation algorithms such as Local Threshold Segmentation. Current VLM agents struggle with this level of control precision, often producing misaligned or anatomically incorrect masks. Those flawed artifacts then mislead downstream reasoning, leading to lower overall accuracy despite access to stronger tools (Shen et al., 25 Mar 2026).
This is an important corrective to a common assumption that more tools necessarily improve agent performance. In MedFlow-Bench, tool-use introduces a coordination problem between semantic reasoning and precise spatial control. The failure mode is concrete: imprecise coordinates lead to a misaligned mask, which leads to misleading quantitative evidence, which leads to an incorrect final decision. The benchmark therefore surfaces spatial grounding and parameter-control fidelity as first-order capabilities for medical agents rather than secondary engineering details (Shen et al., 25 Mar 2026).
A plausible implication is that future progress on this benchmark will depend less on generic language reasoning gains than on architectures that couple cross-slice visual localization, spatial memory, and tool-parameter calibration.
7. Position within medical benchmark research, limitations, and expected trajectory
The paper situates MedFlow-Bench against static-image perception and report-generation datasets by stating that it moves from curated 2D slices to full-study interaction, requires navigation across sequences or modalities and active evidence gathering, and introduces agentic execution, differential diagnosis, and cross-modality reasoning with an auditable trace. According to the paper’s comparison, it is the only resource among those listed that provides full-study interactive access, cross-modality cases, active exploration, differential diagnosis, and agentic execution in a real viewer (Shen et al., 25 Mar 2026).
The benchmark also sits within a broader landscape of medical and biomedical benchmarking that emphasizes reproducibility, realistic workflows, and explicit validation conditions, although these are methodologically different traditions. For example, FDA nozzle and blood-pump benchmarks in biomedical flow modeling foreground reproducible geometry, instrumented validation, and explicit uncertainty reporting (Raben et al., 2014, Jain, 2020, Huang et al., 2022, Huang et al., 17 Apr 2026). Likewise, a cerebral aneurysm FSI benchmark stresses standardized geometry, material models, and solver comparisons for reproducibility (Goetz et al., 2023). This suggests a shared benchmark philosophy—auditable protocols and constrained comparison—even though MedFlow-Bench addresses agentic medical imaging rather than CFD or FSI.
The paper acknowledges several limitations. The initial release is a foundational first release with only two modules, brain MRI and lung CT/PET. Tool-use remains bottlenecked by unreliable spatial grounding and precise tool control. The paper also does not specify dataset splits, data formats, voxel spacing, licensing, or de-identification (Shen et al., 25 Mar 2026).
Ethical and privacy considerations are noted only at a high level. The benchmark uses public datasets, but the paper does not detail de-identification pipelines or privacy measures within MedFlow-Bench. It states that clinical deployment would require strict auditing and privacy compliance (Shen et al., 25 Mar 2026).
The stated roadmap includes modality expansion to ultrasound, mammography, and longitudinal studies involving prior versus current exams; multi-turn conversational evaluation; integration of EHR context; and expansion of the tool ecosystem with wider MONAI algorithms and quantitative modules, aligned with improved spatial grounding (Shen et al., 25 Mar 2026). This suggests a trajectory toward richer multimodal, temporally contextual, and clinically embedded study-level episodes rather than a mere scaling of case count.
In summary, MedFlow-Bench defines a benchmark regime in which medical imaging agents are evaluated as interactive study navigators inside real viewers under bounded, logged, replayable interfaces. Its principal contribution is not only to replace static-image evaluation with full-study interaction, but to expose the dependency of competent medical agent behavior on auditable evidence gathering, spatially grounded control, and tool-use reliability (Shen et al., 25 Mar 2026).