RadAgent: Chest CT Report Generation Agent

Updated 4 July 2026

RadAgent is a tool-using AI agent designed for chest CT report generation, employing a stepwise, checklist-guided ReAct workflow with a toolbox of ten specialized analysis tools.
It leverages reinforcement learning with GRPO to optimize sequential tool use, thereby enhancing clinical accuracy, adversarial robustness, and report interpretability.
Empirical evaluations demonstrate significant improvements over CT-Chat, with notable gains in macro-F1 and micro-F1 metrics on both CT-RATE and RadChestCT datasets.

Searching arXiv for papers on RadAgent and closely related radiology agent systems. I’ll look up the specific RadAgent paper and a few neighboring radiology-agent papers for context. RadAgent is a tool-using AI agent for stepwise interpretation of chest computed tomography, introduced for automated chest CT report generation from 3D non-contrast volumes through an explicit, inspectable sequence of diagnostic actions rather than a one-step vision-language mapping (Roschewitz et al., 16 Apr 2026). Its defining characteristics are a checklist-guided ReAct-style workflow, a toolbox of ten specialized CT analysis tools, a persistent scratchpad of preliminary findings, and reinforcement-learning optimization of tool-use behavior. The system is evaluated primarily against CT-Chat, the 3D vision-LLM that also supplies RadAgent’s initial draft report, and is presented as improving clinical accuracy, adversarial robustness, and faithfulness while making intermediate evidence available for clinician inspection (Roschewitz et al., 16 Apr 2026).

1. Scope and conceptual position

RadAgent is situated in chest CT report generation, specifically for 3D non-contrast chest CT volumes. The motivating claim is that chest CT reporting requires slice-by-slice inspection of a high-dimensional volume, spatial integration across anatomy and pathology, and transparent evidence for findings, whereas prior 3D VLM systems such as CT-Chat produce final reports without an interpretable intermediate reasoning process (Roschewitz et al., 16 Apr 2026).

The system is explicitly contrasted with one-shot report generation and with training-free agentic workflows. In the paper’s framing, prompt-designed or fixed tool-sequence agents can introduce multi-step behavior, but they still rely on the orchestrating LLM to devise medically adequate plans and to understand tool specifications without policy learning. RadAgent instead treats chest CT interpretation as a tool-augmented sequential decision problem and optimizes the tool-use policy with GRPO. This places it within a broader shift toward agentic radiology systems, but with a narrower and more concrete scope: chest CT report generation rather than general radiology assistance (Roschewitz et al., 16 Apr 2026).

A plausible implication is that RadAgent occupies a distinct position among radiology agents. MedRAX is a chest X-ray-specific ReAct agent built around multimodal tool use for CXR interpretation, while RadAgents and EviAgent are also chest-radiograph systems organized around multi-agent orchestration, grounding, and retrieval (Fallahpour et al., 4 Feb 2025). RadAgent differs primarily by targeting volumetric chest CT and by making reinforcement-learned tool policy, rather than only prompt-level orchestration, central to the method (Zhang et al., 24 Sep 2025).

2. Architecture and tool ecosystem

The architecture consists of an instruction-tuned Qwen3-14B policy model, a clinician-reviewed diagnostic checklist, a toolbox of ten specialized CT tools, a persistent scratchpad represented as preliminary_findings, and a stepwise ReAct-style loop (Roschewitz et al., 16 Apr 2026). The checklist covers nine categories routinely assessed in chest CT interpretation: Airways, Lung parenchyma, Pleura, Heart, Cardiovascular & mediastinum, Diaphragm & upper abdominal organs, Spine, ribs, sternum & clavicles, Chest wall, breasts, axillae, and Devices. It is described as “short and coarse,” intended as a planning scaffold rather than a rigid workflow (Roschewitz et al., 16 Apr 2026).

All tools are exposed via Model Context Protocol servers. The scratchpad is persistent across the trajectory and is required to reflect the current consensus without contradictions. Each tool call must return structured JSON fields including "reasoning", "preliminary_findings", "action": "call_tool", "tool_name", and "arguments", while final termination uses "action": "final_answer" and an "answer" field (Roschewitz et al., 16 Apr 2026).

The toolbox is as follows:

Tool	Backbone	Function
`ct_vqa()`	CT-Chat	3D visual question answering
`slice_vqa()`	google/gemma-3-27b-it	2D slice-level VQA
`disease_classifier()`	CT-CLIP (VocabFine)	18-pathology screening
`report_generation()`	CT-Chat	Initial draft report generation
`anatomy_segmentation()`	TotalSegmentator	Anatomical masks
`effusion_segmentation()`	TotalSegmentator	Pleural/pericardial effusion masks
`biggest_slice_selection()`	—	Largest segmented slice per component
`get_several_slices_from_segmentation()`	—	Equidistant slices per region
`extract_slices_from_ct()`	—	Evenly spaced axial/coronal/sagittal slices
`windowing()`	—	Lung, bone, abdomen, mediastinum presets

Several of these tools are compositional rather than diagnostic in isolation. report_generation() supplies the initial CT-Chat draft; disease_classifier() performs broad pathology screening across 18 thoracic pathologies; ct_vqa() and slice_vqa() provide global and slice-local question answering; segmentation and slice-extraction utilities support spatial verification; and windowing() enables CT-specific display presets such as lung $(-600, 1500)$ , bone $(300, 1500)$ , abdomen $(60, 350)$ , and mediastinum $(50, 350)$ (Roschewitz et al., 16 Apr 2026).

3. Agent loop and policy optimization

At inference time, RadAgent starts from a report-generation request and first calls report_generation() to obtain a CT-Chat draft. It then walks through the checklist, deciding which category to inspect next, which tool to invoke, and what question or parameters to use. Tool outputs may be free text, pathology probability vectors, segmentation masks, extracted slices, or windowed images. After each tool call, the agent updates preliminary_findings and continues until it judges the investigation sufficient for final synthesis (Roschewitz et al., 16 Apr 2026).

The stepwise behavior is explicitly designed for adaptive recovery. One qualitative example in the paper shows ct_vqa() failing when the system checks for devices, after which the agent switches to extract_slices_from_ct() and then queries slice_vqa() on the extracted slices to recover the missing information. This indicates that the trajectory is not a fixed chain but a contingent sequence of tool choices conditioned on intermediate observations (Roschewitz et al., 16 Apr 2026).

Training uses GRPO with LoRA adaptation of Qwen3-14B. The reported configuration is LoRA rank $16$, alpha $32$, 8 GH200 GPUs, 8 rollouts per example, batch size $6$, learning rate $0.00001$, and 150 training steps, with convergence defined by validation metrics no longer improving (Roschewitz et al., 16 Apr 2026). The reward combines report quality, tool success, tool diversity, tool-graph coherence, and a checklist/sequence judge:

$R_{\mathrm{quality}} = \text{F1}_{18} + \text{F1}_{\mathrm{abnorm}}$

$\text{Prec}_{\mathrm{abnorm}} = \frac{M_C + 0.5\,P_C}{C}, \qquad \text{Rec}_{\mathrm{abnorm}} = \frac{M_G + 0.5\,P_G}{G}$

$(300, 1500)$ 0

$(300, 1500)$ 1

$(300, 1500)$ 2

$(300, 1500)$ 3

$(300, 1500)$ 4

Early training emphasizes exploration, while late training adds stronger pressure for checklist adherence and coherent tool sequences. The paper does not provide an explicit GRPO loss formula, and there is no supervised trajectory imitation stage described (Roschewitz et al., 16 Apr 2026).

4. Datasets, metrics, and empirical performance

RadAgent is trained and evaluated on CT-RATE and externally evaluated on RadChestCT. CT-RATE contains 25,692 non-contrast 3D chest CT scans from 21,304 patients with paired reports; the authors use the official train/test split and create an internal validation set of 1,000 scans from the training split. RadChestCT contains 36,316 total non-contrast chest CT volumes, of which the public release is 3,632 scans from Duke University Health System, with associated reports and 84 abnormality labels and 52 anatomical location labels (Roschewitz et al., 16 Apr 2026).

Primary report-quality metrics are macro-F1 and micro-F1 over 18 pathology labels extracted from generated reports by the CT-RATE text classifier. The paper deliberately does not use BLEU or ROUGE as primary measures and reports 95% bootstrapped confidence intervals and two-sided permutation tests at 5% significance (Roschewitz et al., 16 Apr 2026).

Against CT-Chat, RadAgent improves macro-F1 by 6.0 points and micro-F1 by 5.4 points on CT-RATE test, corresponding to 36.4% relative improvement in macro-F1 and 19.6% relative improvement in micro-F1 (Roschewitz et al., 16 Apr 2026). The paper states that these gains are statistically significant and consistent across CT-RATE validation, CT-RATE test, and the external RadChestCT dataset. It also compares against a training-free RadAgent variant using the same checklist, tools, and prompt but no RL policy optimization. That training-free variant already improves over CT-Chat in macro-F1, while RL yields further gains and particularly strengthens out-of-domain generalization (Roschewitz et al., 16 Apr 2026).

The learned tool policy is not uniform across the toolbox. A Sankey analysis on CT-RATE validation shows a common pattern centered on report_generation, disease_classifier, and repeated ct_vqa calls, suggesting a practical workflow of initial drafting, broad screening, and iterative verification (Roschewitz et al., 16 Apr 2026).

5. Interpretability, faithfulness, and robustness

RadAgent’s interpretability is grounded in its explicit trajectory rather than in a post hoc explanation module. Each case includes reasoning text, the evolving preliminary_findings, tool names and arguments, tool outputs, the initial CT-Chat draft, and the final report. The paper argues that this allows clinicians to inspect, validate, or refine intermediate decisions rather than remaining “passive observers of final outputs” (Roschewitz et al., 16 Apr 2026).

Faithfulness is evaluated through hint injection. For 1,000 randomly sampled CT-RATE test studies, the authors add either a correct hint or a flipped incorrect hint to the report-generation prompt and measure whether the system explicitly acknowledges the hint when the hint changes its output. Under this metric, RadAgent attains 37.0% faithfulness, whereas CT-Chat is at 0.0% (Roschewitz et al., 16 Apr 2026). Hint acknowledgment is judged by Qwen3-235B-A22B-Instruct-2507, with reported label accuracy of 0.91 for RadAgent cases and 1.00 for CT-Chat cases; the paper therefore treats the faithfulness estimates as upper bounds (Roschewitz et al., 16 Apr 2026).

Robustness is defined as the probability that a model remains correct under an incorrect hint, conditional on being correct without the hint. RadAgent reaches 83.7% robustness under adversarial hint injection, compared with 58.9% for CT-Chat, an absolute gain of 24.7 points and a relative improvement of 41.9% (Roschewitz et al., 16 Apr 2026). The paper attributes this to checklist-guided verification, explicit evidence gathering through tools, and a visible findings state that makes unsupported hints easier to detect.

The qualitative analysis reinforces this interpretation. In one successful trace, the system begins from a CT-Chat draft, performs disease screening, repeatedly checks checklist items with ct_vqa(), and produces a final report incorporating verified emphysema, nodules, and devices. In another, the agent recovers from a failed device query by switching tools, demonstrating resilience not only to adversarial prompts but also to imperfect tool execution (Roschewitz et al., 16 Apr 2026).

6. Relation to adjacent radiology-agent systems and current limitations

RadAgent belongs to a rapidly expanding class of radiology agents, but its specific combination of chest CT focus, tool specialization, and RL-trained orchestration differentiates it from most neighboring work. MedRAX is a CXR agent centered on GPT-4o, LangChain, and LangGraph, with chest-radiograph tools selected by an LLM-driven ReAct loop; RadAgents is a training-free multi-agent CXR framework structured around ABCDE specialists, V-RAG, and verification; EviAgent is an evidence-driven CXR report generator that plans, calls grounding and retrieval tools, extracts evidence, and then writes a report (Fallahpour et al., 4 Feb 2025). RadAgent extends the agentic pattern to 3D chest CT and adds explicit policy optimization for tool use (Qi et al., 14 Mar 2026).

Within chest CT report generation more broadly, RadAgent also sits alongside non-agentic or partially agentic alternatives. AdaRAG-CT argues that 3D CT report generation is bottlenecked by low-dimensional visual embeddings and reports Clinical F1 improving from 0.420 for CT-Agent to 0.480 through adaptive retrieval-augmented generation rather than RL-trained tool orchestration (Liang et al., 16 Mar 2026). This suggests a parallel line of development in which improved retrieval and generation control compensate for visual representation limits, whereas RadAgent emphasizes sequential tool-mediated verification.

Its limitations are substantial and explicitly acknowledged. The system is computationally heavy, requiring a multi-GPU setup with one node for the trained agent and another for distributed tools. The learned policy is optimized for the specific toolbox used during training, so major tool changes may require rerunning the RL pipeline. Faithfulness at 37.0% still leaves substantial room for improvement. The evaluation is confined to chest CT and does not demonstrate extension to other organs, contrast-enhanced CT, MRI, or other modalities. There is no prospective clinical deployment study, and the paper discusses human-in-the-loop opportunities and possible distillation into a fixed workflow rather than reporting operational clinical use (Roschewitz et al., 16 Apr 2026).

A broader implication is that RadAgent should be read as a chest-CT-specific prototype of transparent, tool-mediated radiology AI rather than as a complete radiology copilot. Later benchmarking work such as ABRA, which evaluates agents inside an OHIF/Orthanc environment, reinforces the importance of distinguishing tool orchestration from raw perception and localizes a major bottleneck to perception rather than viewer control (Maksudov et al., 11 May 2026). That diagnosis is consistent with RadAgent’s own design emphasis on specialized perception tools, explicit verification, and clinician-inspectable intermediate state.