Papers
Topics
Authors
Recent
2000 character limit reached

RadRBench-CXR: Clinically Grounded CXR Benchmark

Updated 9 December 2025
  • The paper introduces RadRBench-CXR, a benchmark that evaluates multimodal language models on stepwise diagnostic reasoning using real radiology report chains.
  • It constructs the dataset by mining reasoning chains from major CXR report datasets using GPT-4 to generate clinically relevant questions and diagnostic checkpoints.
  • The study presents RadRScore, a novel metric combining factuality, completeness, and effectiveness to improve and assess model diagnostic accuracy.

RadRBench-CXR is a large-scale, clinically grounded benchmark designed for evaluating stepwise reasoning in chest X-ray (CXR) visual question answering (VQA) by multimodal LLMs (MLLMs). It was introduced in the context of process-supervised model training to more closely reflect the structured reasoning processes followed by radiologists in clinical practice, and to address the insufficiency of existing benchmarks that focus largely on outcome prediction without rigorous assessment of intermediate diagnostic reasoning steps (Fan et al., 29 Apr 2025).

1. Benchmark Construction and Dataset Composition

RadRBench-CXR is built by mining reasoning chains directly from free-text clinical CXR reports in three major public datasets: MIMIC-CXR, CheXpert, and MS-CXR-T. The construction pipeline proceeds via:

  • Clinical VQA Item Generation: For each image-report pair, a prompt engine (GPT-4o) generates multiple VQA-style questions—a mix of binary, single-choice, multi-choice, anomaly detection, and temporal comparison tasks—grounded in the Impression section of clinical reports.
  • Reasoning Plan Extraction: GPT-4o receives the question, answer, and associated report, and returns a reasoning plan: a numbered sequence of diagnostic checkpoints that radiologists plausibly follow (e.g., “Assess heart size; Look for opacities; Examine costophrenic angles”).
  • Evidence Linking and Chain Refinement: Each step in the reasoning plan is mapped to corresponding report sentences. Absence of evidence is handled by assuming “normal” findings in accordance with conventional radiological inference.
  • Final Chains: Refined, clinically validated reasoning chains, typically 4–6 steps each, are consolidated to form the benchmark’s stepwise reference annotations.

RadRBench-CXR comprises:

  • 59,000 reasoning-annotated VQA samples for training and 1,000 for held-out test (totaling ~301,000 stepwise reasoning chains; average 5.1 steps/sample).
  • 1.2 million "answer-only" samples (from additional image-only cases such as RSNA and SIIM datasets, where only answers—not reasoning—are available).

Cases are filtered for high pre-chain factuality (initial Rf=0.82\mathrm{R}_f=0.82; only those reaching Rf=1.0\mathrm{R}_f=1.0 are included, yielding ≈90% retention) via comparison to report findings (Fan et al., 29 Apr 2025).

2. Task Taxonomy and Data Schema

RadRBench-CXR evaluates models on five clinically relevant VQA tasks:

  1. Binary disease presence: Yes/no assessments for a specific abnormality (e.g., “Cardiomegaly present?”)
  2. Single-disease multi-choice: Four-way forced choice among alternate findings.
  3. Multiple-disease multi-choice: Select relevant combination(s) among four options.
  4. Anomaly detection: Open-ended list extraction; models enumerate all clinically supported findings.
  5. Temporal comparison: Assess change in findings (worsening, stable, improving) via prior/follow-up imaging context.

Each reasoning sample in the benchmark encodes:

  • An instruction prompt (specifying task and, optionally, “think step by step”)
  • Embedded CXR images
  • For reasoning-augmented cases: a gold, numbered chain of diagnostic observations enclosed in custom tokens (e.g., > …)
  • The final clinical answer in an <answer>…</answer> tag

3. Evaluation Metric: RadRScore

RadRBench-CXR introduces RadRScore, a structured, multi-component metric for objective evaluation of generated diagnostic reasoning:

  • Factuality (RfR_f): Proportion of generated model observations that are present in the original radiology report,

Rf=obsmodelobsreportobsmodelR_f = \frac{|\mathrm{obs}_{\mathrm{model}} \cap \mathrm{obs}_{\mathrm{report}}|}{|\mathrm{obs}_{\mathrm{model}}|}

  • Completeness (RcR_c): Proportion of ground-truth chain observations recovered by the model,

Rc=obsgtobsmodelobsgtR_c = \frac{|\mathrm{obs}_{\mathrm{gt}} \cap \mathrm{obs}_{\mathrm{model}}|}{|\mathrm{obs}_{\mathrm{gt}}|}

  • Effectiveness (ReR_e): Fraction of the model’s generated observations that are also present in the ground-truth chain,

Re=obsmodelobsgtobsmodelR_e = \frac{|\mathrm{obs}_{\mathrm{model}} \cap \mathrm{obs}_{\mathrm{gt}}|}{|\mathrm{obs}_{\mathrm{model}}|}

  • Aggregate RadRScore:

RadRScore=Rf+Rc+Re3,RadRScore[0,1]\mathrm{RadRScore} = \frac{R_f + R_c + R_e}{3}, \quad \mathrm{RadRScore} \in [0, 1]

Semantic matching—using GPT-4o for paraphrase alignment—ensures alignment is clinically meaningful and not just string-based (Fan et al., 29 Apr 2025).

4. Role in Model Training and Process Supervision

RadRBench-CXR enables a two-stage model training framework for process-supervised MLLMs:

  • Supervised Fine-Tuning (SFT): Mixed batches of reasoning-augmented and answer-only samples are used for autoregressive training (cross-entropy minimization) over the concatenated “reasoning chain + answer” outputs.
  • Reinforcement Learning (RL) with Process Reward: Group Proximal Policy Optimization (GRPO) is employed, using a composite reward signal that incorporates outcome format adherence, answer correctness, and process factuality (RfR_f from RadRScore). Process rewards specifically incentivize clinically valid stepwise reasoning, in contrast to models supervised only on outcome labels.

Empirical ablations confirm that inclusion of answer-only SFT improves domain alignment (i.e., answer accuracy), while integrating a process reward in RL further boosts both RadRScore and diagnostic accuracy (Fan et al., 29 Apr 2025).

5. Experimental Outcomes and Comparative Performance

RadRBench-CXR underpins rigorous comparative studies of process-supervised MLLMs:

  • Reasoning Ability: ChestX-Reasoner (7B) achieves an average RadRScore of 0.531:
    • +16% over best prior medical MLLM (MedDr-40B: 0.367)
    • +5.9% over best general MLLM (GPT-4o: 0.472)
    • +18% over its own unsupervised Qwen2VL-7B backbone (0.449)
  • Diagnostic Accuracy: Overall across five tasks:
    • ChestX-Reasoner: 0.781
    • CheXagent-3B: 0.756
    • GPT-4o: 0.630
    • Qwen2VL-7B: 0.615

Task-wise, ChestX-Reasoner demonstrates 3.3%–27% increases in accuracy compared to previous models. Across reasoning sub-metrics, improvements of 5–25% are observed in factuality, completeness, and effectiveness (Fan et al., 29 Apr 2025).

6. Structural and Conceptual Advances

RadRBench-CXR introduces several key methodological advances:

  • Scale: 59,000 reasoning-validated, VQA-style CXR samples with 301,000+ annotated steps, greatly exceeding prior datasets in structured process annotation.
  • Factual grounding: Direct supervision using real clinician-authored reasoning chains, as opposed to distilled or synthetic chains-of-thought, substantially improves stepwise chain factuality (0.82 post-GPT-4o mining vs. 0.60 for directly sampled GPT-4o-only baselines).
  • Multidimensional reasoning evaluation: RadRScore captures not only the presence of correct outcomes but the clinical validity and completeness of intermediate reasoning steps.
  • Open resource: The dataset and benchmark are fully open-sourced to promote reproducibility and community-driven extensions.

7. Limitations and Future Prospects

RadRBench-CXR’s current scope is limited to chest X-ray imaging; extension to other modalities (CT, MRI, ultrasound) is considered straightforward via analogous chain extraction from domain reports but remains to be explicitly realized. The available base model is Qwen2VL-7B; process supervision on larger or fully medical-pretrained backbones constitutes a logical next step. The current reinforcement process reward is rule-based factuality; moving toward learned, agentic reward models for process correctness is forecast as a promising research direction (e.g., see the proposal for agentic reward modeling in Peng et al.) (Fan et al., 29 Apr 2025).

RadRBench-CXR thus establishes a new paradigm for evaluating and training process-supervised radiology MLLMs via large-scale, real-world stepwise reasoning chains and multidimensional process-aware metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to RadRBench-CXR.