Evaluating Diagnostic Reasoning of LVLMs in Chest X-rays: CXReasonBench Overview
In the quest to leverage Large Vision-LLMs (LVLMs) for clinical applications, diagnostic reasoning from medical imagery remains a formidable challenge. Despite advancements in LVLMs facilitating tasks such as report generation and visual question answering (VQA), current benchmarks predominantly evaluate the accuracy of diagnostic outcomes, neglecting the critical intermediate reasoning processes. The paper "CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays" attempts to address this gap by introducing CheXStruct and CXReasonBench—tools for assessing whether LVLMs conduct clinically grounded reasoning from chest X-rays.
Methodology and Dataset
CheXStruct provides a structured pipeline to extract and evaluate intermediate diagnostic reasoning steps from chest X-ray images, leveraging the MIMIC-CXR-JPG dataset. It automates the derivation of clinically relevant reasoning components such as anatomical segmentation, landmark extraction, measurement computation, and application of diagnostic thresholds following clinical criteria. Importantly, CheXStruct incorporates task-specific quality control (QC) to ensure that only anatomically valid and clinically reliable data is utilized, presenting a robust mechanism for modeling comprehensive diagnostic processes.
CXReasonBench utilizes CheXStruct outputs to facilitate a more granular evaluation of LVLM diagnostic reasoning through a novel benchmarking framework. It integrates visual grounding components and structured decision pathways, enabling detailed scrutiny of the model's alignment with clinical practices. Spanning multiple paths and stages—from direct reasoning evaluation to guided reasoning and re-evaluation—CXReasonBench examines whether models can internalize and generalize structured diagnostic reasoning across diverse clinical tasks.
Experimental Findings and Analysis
The paper evaluates 10 LVLMs, revealing widespread difficulty in executing valid structured diagnostic reasoning. Notably, even top-performing models like Gemini-2.5-Pro struggle to bridge abstract diagnostic knowledge with anatomically grounded visual interpretation. This disconnect reveals the limitations of current LVLMs in contextually applying diagnostic criteria, underscoring the prevalent reliance on heuristic shortcuts rather than engaging with structured reasoning.
Performance Trends:
- Closed-source models generally outperform open-source models across reasoning stages, though both categories share fundamental limitations in intermediate reasoning phases.
- Recognition-type tasks (e.g., identifying tracheal deviation) typically show higher consistency in decision alignment compared to measurement-type tasks, emphasizing the model’s reliance on visual pattern recognition rather than precise computation.
Impact of Sample Variability:
- Stochastic sampling results indicate greater performance variability among open-source models, highlighting their unsound multimodal reasoning capabilities and potential weaknesses in retaining consistency across reasoning stages.
Opportunity for Future Research:
- Models demonstrated capacity for computation and instruction adherence when provided step-by-step guidance, suggesting a path forward in training paradigms that explicitly align visual grounding with structured reasoning supervision.
Implications and Future Directions
The development of CXReasonBench represents a crucial step toward refining LVLMs for healthcare applications, emphasizing the need for transparent, criterion-driven diagnostic reasoning assessment. The insights gained from this paper imply forthcoming advancements in LVLM capabilities, guiding research toward comprehensive training methodologies that explicitly teach structured reasoning skills. Future endeavors should expand on this foundation by incorporating broader diagnostic tasks, integrating additional datasets, and developing instruction tuning techniques to further enhance the clinical robustness and applicability of LVLMs in radiology.
By bridging the current diagnostic reasoning gap with robust evaluation frameworks, this research sets a pivotal precedent in the pursuit of clinically grounded AI models ready to transform medical imaging and diagnostics.