MedRAX: Medical Reasoning Agent for Chest X-ray (2502.02673v2)

Published 4 Feb 2025 in cs.LG, cs.AI, and cs.MA

Abstract: Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal LLMs into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available at https://github.com/bowang-lab/MedRAX

Summary

The paper introduces MedRAX, an agent-based framework integrating specialized AI tools via a ReAct loop for transparent and adaptable chest X-ray interpretation.
MedRAX outperforms general and specialized medical models on the new ChestAgentBench, achieving 63.1% accuracy on complex clinical reasoning tasks.
MedRAX's modular architecture allows integration of diverse tools without retraining, enhancing transparency and adaptability in clinical settings.

The paper introduces MedRAX, an agent-based framework that unifies state-of-the-art chest X-ray (CXR) interpretation tools and large-scale multimodal reasoning capabilities under a structured ReAct loop. The framework is designed to decompose complex medical queries into a series of sequential analytical steps—observation, reasoning, and action—using a LLM. This design enables dynamic tool selection and iterative refinement, thereby integrating heterogeneous models specialized in visual QA, segmentation, grounding, report generation, disease classification, and image synthesis.

MedRAX leverages an underlying ReAct loop that coordinates multiple specialized modules such as:

Visual Question Answering (VQA): Incorporation of models like CheXagent and LLaVA-Med facilitates fine-grained visual reasoning on CXRs, enabling the system to answer free-form clinical queries.
Segmentation: Tools like MedSAM and a PSPNet model trained on ChestX-Det are used to partition images into anatomically meaningful regions.
Grounding: The framework employs Maira-2 to accurately localize textual descriptions in the imaging data.
Report Generation and Disease Classification: A SwinV2 Transformer coupled with a BERT decoder and a DenseNet-121-based classifier from TorchXRayVision support structured report generation and multi-label pathology classification.
Chest X-ray Generation: RoentGen synthesizes realistic CXRs, facilitating augmented tool-based reasoning.

The technical novelty of MedRAX lies in its ability to orchestrate these disparate tools without requiring additional training, thus promoting adaptability and transparency in decision-making. A memory component maintains past interactions and outputs, which is crucial for multi-turn dialogues and avoiding redundant computations. The framework is modular by design, allowing easy integration of new tools through well-defined input/output specifications, and it can be deployed in various clinical settings—from local installations to cloud-based architectures—fulfilling data privacy requirements.

A critical contribution of the paper is the development of ChestAgentBench, a benchmark compiled from 675 expert-curated clinical cases to generate 2,500 six-choice questions. This benchmark segments evaluation into seven clinically relevant categories:

Detection
Classification
Localization
Comparison
Relationship
Diagnosis
Characterization

The quantitative evaluation shows that MedRAX achieves an overall accuracy of 63.1% on ChestAgentBench, outperforming general-purpose models such as GPT-4o (56.4%) and Llama-3.2-90B (57.9%), as well as specialized medical models like CheXagent (39.5%) and LLaVA-Med (28.7%). Similarly, on the CheXbench evaluation—comprising visual QA tasks (Rad-Restruct and SLAKE datasets) and fine-grained image-text reasoning—the framework demonstrates superior performance, with notable improvements on Rad-Restruct (achieving 68.7%) and SLAKE (achieving 82.9%) when compared to the other baselines.

The paper also discusses detailed case studies demonstrating MedRAX's proficiency in resolving conflicting tool outputs. In one instance, MedRAX successfully identifies a chest tube by integrating conclusions from both report generation and visual QA, despite conflicting interpretations from one of its modules. In another case, the system accurately diagnoses a left pneumothorax by sequentially leveraging disease identification and segmentation outputs, rectifying misinterpretations exhibited by end-to-end models.

Key insights articulated in the discussion include:

Task Decomposition: The ReAct-based iterative reasoning enhances transparency, interpretability, and facilitates error-tracing when integrating multiple domain-specific tools.
Generalist versus Specialist Capabilities: The findings denote that general-purpose foundation models, when equipped with structured tool orchestration, can surpass specialized medical models in complex reasoning tasks.
Limitations and Future Directions: While the framework exhibits strong performance, challenges remain in resolving contradictory outputs from different modules and managing the computational overhead associated with running multiple specialized tools. Future work is advised to explore reinforcement learning strategies for uncertainty-aware reasoning and more balanced tool utilization, aiming to further optimize clinical decision-making processes.

Overall, the paper presents a cohesive and implementable framework that not only achieves superior performance on standardized benchmarks but also offers a transparent, modular, and adaptable architecture for clinically-relevant CXR interpretation.

PDF Markdown

Related Papers

GitHub

bowang-lab/MedRAX · GitHub

Tweets

https://twitter.com/BoWang87/status/1887561284797034888

https://twitter.com/alifmunim/status/1887892820402262040

https://twitter.com/DrDatta_AIIMS/status/1887851022980108787

https://twitter.com/adibvafa/status/1924212330675556448

https://twitter.com/arxivsanitybot/status/1887690628902879349

https://twitter.com/adibvafa/status/1892996826862649695

HackerNews

MedRAX: Medical Reasoning Agent for Chest X-ray (1 point, 0 comments)