MedOpenClaw: Secure Runtime for 3D Imaging

Updated 4 July 2026

MedOpenClaw is a secure, auditable runtime that enables agents to interact with full 3D studies instead of curated 2D images.
It integrates with 3D Slicer via REST endpoints and bridge handlers to control actions, log evidence, and support advanced tool use.
Its design emphasizes reproducibility, spatial grounding, and traceability, facilitating interactive diagnostics in benchmarks like MedFlow-Bench.

Searching arXiv for papers relevant to MedOpenClaw and closely related benchmarks/runtimes. MedOpenClaw is a secure, auditable runtime/API layer that sits between a backbone vision–LLM agent and a clinical imaging viewer, concretely 3D Slicer, so that the agent can operate over complete, uncurated volumetric studies rather than over pre-selected diagnostically relevant 2D images. It was introduced together with MedFlow-Bench to evaluate full-study interactive reasoning in multi-sequence brain MRI and paired lung CT/PET under controlled, replayable protocols, with emphasis on study navigation, evidence acquisition, spatial grounding, and traceable decision support (Shen et al., 25 Mar 2026).

1. Problem setting and intended scope

MedOpenClaw was proposed in response to a limitation in prevailing medical VLM evaluation: most benchmarks reduce diagnosis to pre-selected 2D images that require substantial manual curation and do not reflect the study-level, interactive character of radiological work. In the formulation adopted by MedOpenClaw, a clinically relevant agent must load a full 3D examination, enumerate and compare sequences or modalities, scroll through slices, adjust display settings such as windowing or PET/CT fusion, and, when permitted, invoke professional tools such as segmentation or quantitative measurement before committing to a final answer (Shen et al., 25 Mar 2026).

This runtime therefore targets a different problem class from static-image perception. The operative unit is the study-level episode, not the isolated image. The benchmarked behavior is not only recognition but sequential search, evidence collection, and justified decision formation under an explicit action interface. This suggests that MedOpenClaw is best understood as infrastructure for interactive medical-imaging agents rather than as a conventional dataset or model family.

2. Runtime architecture and control surface

MedOpenClaw is a runtime, not a model. It runs externally and does not modify 3D Slicer’s source code. Control is mediated through 3D Slicer’s documented WebServer REST endpoints, while operations not cleanly covered by REST, including DICOM import, quantitative measurement, and DICOM SEG export, are exposed through named bridge handlers. To keep execution bounded and auditable, arbitrary Python execution inside Slicer’s embedded console is prohibited; only predefined operations are callable (Shen et al., 25 Mar 2026).

The runtime exposes a tiered action space aligned to the evaluation tracks of MedFlow-Bench.

Tier	Operations	Role
Primitive viewer actions	Select series or volumes; scroll through slices; adjust windowing/fusion; enumerate available sequences/modalities	Navigation and display control
Evidence operations	Bookmark views; draw masks; create measurement logs; export snapshots and evidence objects	Transparent artifact production
Expert tools	Segmentation and quantitative analysis via a MONAI-based reference tool pack; controlled parameterized calls	Advanced analysis

Every call returns a bounded, explicit response from the REST or bridge handler. After each action, the runtime logs a viewer-state snapshot containing the accessed series, display parameters such as window or fusion settings, current slice information, and any produced evidence artifacts such as masks or measurements. Tool arguments, outputs, and state snapshots collectively form a machine-verifiable trajectory (Shen et al., 25 Mar 2026).

Architecturally, this produces a narrow and inspectable mediation layer between the agent policy and the viewer. The significance of that design is twofold. First, it constrains the action surface to operations that are clinically interpretable. Second, it turns interactive image navigation into a logged computational process rather than an opaque exchange of screenshots and free-form text.

3. Auditability, reproducibility, and spatial grounding

Auditability is a defining property of MedOpenClaw. The runtime logs every tool invocation, its arguments, timestamps, the resulting viewer-state snapshot, and exported artifacts such as measurements, masks, and bookmarks. Because traces are reconstructable, a completed episode can be replayed to determine where the agent looked, what it did, and which evidence supported the final answer (Shen et al., 25 Mar 2026).

Reproducibility follows from the same design. Episodes are standardized, the action interface is bounded, and runs are intended to be replayable across models and prompts. Safety is addressed primarily through interface restriction: predefined operations only, explicit and tiered tool access, and no arbitrary Python execution inside the viewer environment. A plausible implication is that MedOpenClaw operationalizes auditability not merely as logging but as a control principle: the runtime narrows the space of possible actions so that trace review remains meaningful.

Within this framework, “spatial grounding” has a specific technical meaning. It denotes the agent’s ability to specify precise spatial inputs in the study’s native world space, at millimeter-level accuracy, so that expert tools such as local-threshold segmentation receive anatomically correct seeds or extents. The runtime operates in the viewer’s world coordinate system, and the standard voxel-to-world relation is expressed as

$x_{\text{world}} = T_{\text{affine}} x_{\text{voxel}}.$

In the reported release, grounding is evaluated indirectly through task outcomes rather than by explicit coordinate-error metrics: imprecise coordinates yield misaligned masks, which then degrade downstream diagnostic accuracy (Shen et al., 25 Mar 2026).

4. MedFlow-Bench: benchmark structure and evaluation protocol

MedFlow-Bench is the study-level benchmark built on MedOpenClaw. The current release covers two modules: multi-sequence brain MRI from UCSF-PDGM with T1, T1 post-contrast/T1c, T2, and FLAIR; and paired lung CT/PET from NSCLC radiogenomics. Cases are delivered as full volumetric study packages with metadata, and each episode specifies a task prompt, an allowed action space determined by track, and a canonical answer schema for scoring (Shen et al., 25 Mar 2026).

Module	Task definition	Primary metric
Brain MRI	Case-level diagnosis over a fixed label set	Case-level accuracy
Lung CT/PET	Tumor location, pathological T stage, pathological N stage, histology, histopathological grade	Case-exact accuracy

The benchmark is organized into three tracks. Track A, Viewer-Only, permits only primitive viewer actions such as series selection, slice scrolling, and windowing or fusion control. Track B, Tool-Use, additionally permits expert modules and evidence operations, including MONAI-based segmentation and quantification. Track C, Open-Method, allows any pipeline that consumes raw cases and outputs the canonical answer schema, thereby decoupling the benchmark from the MedOpenClaw runtime itself (Shen et al., 25 Mar 2026).

Two response protocols are defined. In the multiple-choice protocol, explicit options are supplied and this serves as the primary protocol in the initial baselines. In the open-ended protocol, the same tasks are posed without options and scored via an LLM judge against canonical answers as a secondary robustness check. The principal metrics are standard accuracy for single-task modules,

$A = \frac{N_{\text{correct}}}{N_{\text{total}}},$

case-exact accuracy for the five-task lung module,

$A_{\text{case-exact}} = \frac{N_{\text{case-all-correct}}}{N_{\text{cases}}},$

and question-level accuracy for each lung subtask,

$A_{\text{subtask}} = \frac{N_{\text{correct-subtask}}}{N_{\text{cases}}}.$

Some result tables additionally report average tool calls per episode in parentheses alongside accuracy values (Shen et al., 25 Mar 2026).

5. Empirical findings and the Tool-Use Paradox

Initial Viewer-Only baselines show that frontier VLMs can already perform nontrivial study-level interaction. On Brain MRI case-level accuracy, the reported scores are GPT-5.4 at 0.61 (5.9), GPT-5-mini at 0.43 (2.24), Gemini-3.1-flash-preview at 0.56 (9.6), and Gemini-3.1-pro-preview at 0.63 (7.2), the best among the listed models. On Lung CT/PET case-exact accuracy, the scores are GPT-5.4 at 0.32 (11.5), GPT-5-mini at 0.20 (1.85), Gemini-3.1-flash-preview at 0.52 (19.6), the best overall, and Gemini-3.1-pro-preview at 0.31 (11.7) (Shen et al., 25 Mar 2026).

Performance varies substantially by lung subtask. Reported Viewer-Only accuracies are, for Tumor Location, 0.46, 0.14, 0.42, and 0.43 for GPT-5.4, GPT-5-mini, Gemini-flash, and Gemini-pro respectively; for Pathological T Stage, 0.32, 0.17, 0.21, and 0.31; for Pathological N Stage, 0.38, 0.09, 0.83, and 0.35, with Gemini-flash best; for Histology, 0.36, 0.06, 0.72, and 0.34, again with Gemini-flash best; and for Histopathological Grade, 0.07, 0.04, 0.44, and 0.11, with Gemini-flash best. The pattern reported in the paper is that macroscopic tasks such as tumor location are moderately tractable, whereas fine-grained tasks such as histopathological grade remain difficult (Shen et al., 25 Mar 2026).

The paper’s central empirical claim is the “Tool-Use Paradox.” Enabling professional support tools does not automatically improve results and can reduce them. For GPT-5-mini, Viewer-Only versus segmentation-toolkits yields Brain MRI 0.43 to 0.45 and Lung CT/PET 0.20 to 0.14. For GPT-5.4, the corresponding change is Brain MRI 0.61 to 0.57 and Lung CT/PET 0.32 to 0.27. The authors attribute this degradation to imprecise spatial grounding: agents often fail to provide the millimeter-precise coordinates required by expert algorithms such as local-threshold segmentation, producing misaligned masks and misleading evidence rather than useful assistance (Shen et al., 25 Mar 2026).

This result constrains an otherwise common assumption that more powerful tools necessarily imply better agent performance. In the MedOpenClaw setting, the bottleneck is not only perceptual competence or tool availability but control precision in the viewer’s spatial frame.

6. Position in the literature, nomenclature, and current limitations

Among the resources compared in the MedOpenClaw paper, MedFlow-Bench is described as the only one that simultaneously provides full-study interactive access, cross-modality cases, active exploration, differential diagnosis, and required agentic execution. The contrast class includes benchmarks centered on static 2D inputs such as VQA-RAD, VQA-Med, SLAKE, PMC-VQA, OmniMedVQA, and MedXpertQA-MM; resources using pre-selected slices or full-study static inputs without interaction such as NOVA, MedThinkVQA, ReXGroundingCT, 3D-RAD, and ViMed-PET; simulated or synthetic tool environments such as RadABench; and text-only sequential diagnosis settings such as Nori et al. and MedCaseReasoning that do not require volumetric navigation (Shen et al., 25 Mar 2026).

The term itself also requires disambiguation. RealClawBench explicitly states that it does not define or mention a “MedOpenClaw” variant or any domain-specific medical extension (Lv et al., 2 Jun 2026). In a different line of work, OpenCLIPER uses “MEDOPENCLAW” to denote “medical OpenCL-based workloads and workflows,” which is an OpenCL-centered framing rather than the 3D Slicer-centered medical-agent runtime (Simmross-Wattenberg et al., 2018). The OpenClaw security paper "Clawdrain" studies tool-calling attacks in OpenClaw deployments and concerns a different runtime and threat model (Dong et al., 1 Mar 2026). An older OpenCL medical-computing paper, oclMC, illustrates cross-platform medical dose calculation via OpenCL and likewise belongs to a separate lineage (Tian et al., 2015). These usages indicate that the string “MEDOPENCLAW” has appeared in multiple contexts; in current medical-imaging agent literature, however, it denotes the auditable runtime introduced with MedFlow-Bench.

The present scope of MedOpenClaw remains limited. The initial release covers two modules only: multi-sequence brain MRI and lung CT/PET. The paper does not explicitly report train/validation/test splits or dataset sizes. Hardware, operating system, and preprocessing details are not specified. The primary failure mode identified is incorrect tool parameterization leading to misaligned masks and misleading evidence. More broadly, the paper treats auditability as a prerequisite for clinical trust, but also states that spatial control remains an open challenge before reliable end-to-end clinical workflow execution is feasible (Shen et al., 25 Mar 2026).

The resulting research agenda is correspondingly specific. The authors propose scaling to additional modalities such as ultrasound, mammography, and longitudinal comparison; broadening evaluation to multi-turn conversational tracks and EHR integration; enriching the expert-tool layer with more specialized MONAI-based algorithms; and improving spatial grounding, UI constraints, grounded policy training, multi-agent collaboration, and standardized clinical task sets. This suggests that MedOpenClaw is not merely an evaluation wrapper around 3D Slicer, but a controlled experimental substrate for studying how medical agents search, ground, act, and justify decisions over full studies under auditable conditions (Shen et al., 25 Mar 2026).