ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification (2504.20930v2)

Published 29 Apr 2025 in cs.AI, cs.CL, and cs.CV

Abstract: Recent advances in reasoning-enhanced LLMs and multimodal LLMs (MLLMs) have significantly improved performance in complex tasks, yet medical AI models often overlook the structured reasoning processes inherent in clinical practice. In this work, we present ChestX-Reasoner, a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports, reflecting the step-by-step reasoning followed by radiologists. We construct a large dataset by extracting and refining reasoning chains from routine radiology reports. Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards. We introduce RadRBench-CXR, a comprehensive benchmark featuring 59K visual question answering samples with 301K clinically validated reasoning steps, and propose RadRScore, a metric evaluating reasoning factuality, completeness, and effectiveness. ChestX-Reasoner outperforms existing medical and general-domain MLLMs in both diagnostic accuracy and reasoning ability, achieving 16%, 5.9%, and 18% improvements in reasoning ability compared to the best medical MLLM, the best general MLLM, and its base model, respectively, as well as 3.3%, 24%, and 27% improvements in outcome accuracy. All resources are open-sourced to facilitate further research in medical reasoning MLLMs.

PDF Abstract

This paper introduces ChestX-Reasoner (Fan et al., 29 Apr 2025 ), a multimodal LLM (MLLM) specifically designed for Chest X-ray diagnosis with enhanced reasoning capabilities. The core idea is to leverage the structured reasoning process radiologists follow and document in clinical reports, using this information as process supervision to train the model. Unlike prior medical AI models that often rely solely on outcome-based supervision, ChestX-Reasoner aims to generate step-by-step rationales that mirror clinical practice, improving interpretability and performance.

The research highlights that daily radiology reports are a rich source of structured reasoning chains, typically progressing from findings to impressions. To utilize this, the authors developed a pipeline to mine reasoning chains from routine clinical reports. This pipeline involves prompting GPT-4o (2410.21276) to construct structured reasoning plans based on image-report pairs, extracting diagnostic evidence (clinical observations) from the report for each plan, and refining these into coherent reasoning chains. This process, illustrated in Figure 1(b) and Supplementary Figure 1(a), provides a scalable way to generate high-quality, factual reasoning supervision data, contrasting with manual annotation or distillation from general LLMs.

To support the development and evaluation of reasoning-enhanced medical MLLMs, the paper introduces RadRBench-CXR, a benchmark featuring 59K visual question answering (VQA) samples with 301K clinically validated reasoning steps. Samples are drawn from public datasets like MIMIC-CXR (Johnson et al., 2019 ), ChexPert (Johnson et al., 2019 ), and MS-CXR-T [2023], covering diverse tasks: binary, single, and multiple disease diagnosis, anomaly detection, and temporal comparison analysis. The benchmark includes data used for training with mined reasoning and also samples from SIIM [2019] and RSNA [2019] for cross-center validation without requiring mined reasoning.

A novel evaluation metric, RadRScore, is proposed to quantitatively measure reasoning quality. RadRScore assesses three dimensions (Figure 1(b)):

Factuality ( $R_f$ ): Proportion of generated reasoning entities matching the clinical report.
Completeness ( $R_c$ ): Proportion of ground-truth reasoning entities matched by the model's output.
Effectiveness ( $R_e$ ): Proportion of generated reasoning entities that are relevant to the ground-truth reasoning.

RadRScore is calculated as the mean of these three dimensions: $\text{RadRScore}=(R_f+R_c+R_e)/3$ . The calculation involves using LLMs (GPT-4o) to extract clinical entities from the model's output, ground-truth reasoning, and the clinical report.

The ChestX-Reasoner model is based on Qwen2VL-7B (Wang et al., 18 Sep 2024 ) and trained using a two-stage framework (Figure 1(c), Figure 4):

Stage I: Supervised Fine-Tuning (SFT): The base model is trained auto-regressively on both reasoning-augmented ( $\mathcal{D}_R$ ) and answer-only ( $\mathcal{D}_A$ ) data. This involves maximizing the likelihood of generating the expected output sequence (reasoning steps + answer for $\mathcal{D}_R$ , answer only for $\mathcal{D}_A$ ).

$\mathcal{L}_{\text{SFT}}=-\mathbb{E}_{(\mathcal{X},\mathcal{Q},\mathcal{P},\mathcal{C},\mathcal{A})\sim\{\mathcal{D}_A,\mathcal{D}_R\}\sum_{t=1}^{T} \log\pi_{\theta}(y_{t}\mid \mathcal{X},\mathcal{Q},y_{<t})$

where $y$ is the concatenated sequence of reasoning ( $\mathcal{P}$ ) and answer ( $\mathcal{A}$ ), or just the answer ( $\mathcal{A}$ ) for $\mathcal{D}_A$ . This combined SFT approach is found to be crucial for initial domain alignment and leveraging all available data.
Stage II: Reinforcement Learning (RL) with Process Reward: Starting with the SFT model, the GRPO (Shao et al., 5 Feb 2024 ) algorithm is applied. This stage uses a reward function that incorporates:
- Outcome Format Reward ( $R_{\text{format}}$ ): Rewards adherence to the expected output format (e.g., using > and <answer> tags). > * Outcome Accuracy Reward ( $R_{\text{oa}}$ ): Rewards correctness of the final predicted answer compared to the ground truth, using RaTEScore [2024] for open-ended tasks and exact match for closed-ended. > * Process Factuality Reward ( $R_f$ ): This is the key addition. For samples in $\mathcal{D}_R$ , this reward is the factuality score ( $R_f$ ) calculated from the generated reasoning steps compared to the clinical report. This incentivizes generating factually correct intermediate reasoning. > > The reward for an output $o_i$ is defined as: > > $r_i = \left\{ \begin{array}{ll} R_{\text{format}}(o_i;\text{format})+R_{\text{oa}}(o_i;\mathcal{A}), & (\mathcal{X},\mathcal{Q},\mathcal{P},\mathcal{C},\mathcal{A})\in \mathcal{D}_A \ R_{\text{format}}(o_i;\text{format})+R_{\text{oa}}(o_i;\mathcal{A})+R_f(o_i;\mathcal{C},\mathcal{A}), & (\mathcal{X},\mathcal{Q},\mathcal{P},\mathcal{C},\mathcal{A})\in\mathcal{D}_R \end{array} \right.$ > > This process reward directly supervises the quality of the reasoning steps, a crucial aspect missing in prior outcome-only RL approaches. > > Implementation details include training on 8 Tesla A100 GPUs, using AdamW optimizer [2019], cosine learning rate schedule, and DeepSpeed-ZeRO2 [2020] for SFT and PyTorch-FSDP [2023] with VeRL [2024] for RL (Supplementary Tables 1 and 2). The SFT stage took about 2 days, and the RL stage took about 3.5 days. > > Evaluation results on RadRBench-CXR demonstrate ChestX-Reasoner's superiority (Figures 2 and 3). It significantly outperforms medical and general-domain MLLMs in both reasoning ability (RadRScore) and outcome accuracy across various tasks. For example, it achieves substantial improvements in RadRScore factuality, completeness, and effectiveness compared to baselines. On outcome accuracy, it surpasses state-of-the-art medical MLLMs like CheXagent-3B [2024] and general MLLMs like GPT-4o (2410.21276). The ablation paper (Figure 5) confirms the critical role of both SFT and RL, the benefit of using answer-only data for initial domain alignment, and the essential contribution of the process reward to reasoning ability and overall performance. > > From a practical perspective, ChestX-Reasoner offers a path towards more interpretable and reliable medical AI systems. By generating step-by-step rationales, its outputs are better aligned with clinical workflows, providing transparent decision support for radiologists in tasks like reporting and differential diagnosis. The emphasis on factual correctness and traceability, enabled by the process supervision from clinical reports, enhances auditability, which is vital for clinical deployment. > > The authors acknowledge limitations, including the focus on chest X-rays (though the method is generalizable), the use of a specific base model (Qwen2VL-7B), and the current rule-based factuality score calculation. Future work could extend to other modalities and explore more sophisticated reward models. > > All code, datasets, and models are planned for open-sourcing, facilitating further research in medical reasoning MLLMs. This work provides a practical framework for building medical AI models that not only provide accurate diagnoses but also demonstrate clinically relevant reasoning.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Ziqing Fan (13 papers)
Cheng Liang (7 papers)
Chaoyi Wu (24 papers)
Ya Zhang (222 papers)
Yanfeng Wang (211 papers)
Weidi Xie (132 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos