CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays (2505.18087v1)

Published 23 May 2025 in cs.CV and cs.AI

Abstract: Recent progress in Large Vision-LLMs (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

Summary

Evaluating Diagnostic Reasoning of LVLMs in Chest X-rays: CXReasonBench Overview

In the quest to leverage Large Vision-LLMs (LVLMs) for clinical applications, diagnostic reasoning from medical imagery remains a formidable challenge. Despite advancements in LVLMs facilitating tasks such as report generation and visual question answering (VQA), current benchmarks predominantly evaluate the accuracy of diagnostic outcomes, neglecting the critical intermediate reasoning processes. The paper "CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays" attempts to address this gap by introducing CheXStruct and CXReasonBench—tools for assessing whether LVLMs conduct clinically grounded reasoning from chest X-rays.

Methodology and Dataset

CheXStruct provides a structured pipeline to extract and evaluate intermediate diagnostic reasoning steps from chest X-ray images, leveraging the MIMIC-CXR-JPG dataset. It automates the derivation of clinically relevant reasoning components such as anatomical segmentation, landmark extraction, measurement computation, and application of diagnostic thresholds following clinical criteria. Importantly, CheXStruct incorporates task-specific quality control (QC) to ensure that only anatomically valid and clinically reliable data is utilized, presenting a robust mechanism for modeling comprehensive diagnostic processes.

CXReasonBench utilizes CheXStruct outputs to facilitate a more granular evaluation of LVLM diagnostic reasoning through a novel benchmarking framework. It integrates visual grounding components and structured decision pathways, enabling detailed scrutiny of the model's alignment with clinical practices. Spanning multiple paths and stages—from direct reasoning evaluation to guided reasoning and re-evaluation—CXReasonBench examines whether models can internalize and generalize structured diagnostic reasoning across diverse clinical tasks.

Experimental Findings and Analysis

The paper evaluates 10 LVLMs, revealing widespread difficulty in executing valid structured diagnostic reasoning. Notably, even top-performing models like Gemini-2.5-Pro struggle to bridge abstract diagnostic knowledge with anatomically grounded visual interpretation. This disconnect reveals the limitations of current LVLMs in contextually applying diagnostic criteria, underscoring the prevalent reliance on heuristic shortcuts rather than engaging with structured reasoning.

Performance Trends:

Closed-source models generally outperform open-source models across reasoning stages, though both categories share fundamental limitations in intermediate reasoning phases.
Recognition-type tasks (e.g., identifying tracheal deviation) typically show higher consistency in decision alignment compared to measurement-type tasks, emphasizing the model’s reliance on visual pattern recognition rather than precise computation.

Impact of Sample Variability:

Stochastic sampling results indicate greater performance variability among open-source models, highlighting their unsound multimodal reasoning capabilities and potential weaknesses in retaining consistency across reasoning stages.

Opportunity for Future Research:

Models demonstrated capacity for computation and instruction adherence when provided step-by-step guidance, suggesting a path forward in training paradigms that explicitly align visual grounding with structured reasoning supervision.

Implications and Future Directions

The development of CXReasonBench represents a crucial step toward refining LVLMs for healthcare applications, emphasizing the need for transparent, criterion-driven diagnostic reasoning assessment. The insights gained from this paper imply forthcoming advancements in LVLM capabilities, guiding research toward comprehensive training methodologies that explicitly teach structured reasoning skills. Future endeavors should expand on this foundation by incorporating broader diagnostic tasks, integrating additional datasets, and developing instruction tuning techniques to further enhance the clinical robustness and applicability of LVLMs in radiology.