- The paper introduces MedRAX, an agent-based framework integrating specialized AI tools via a ReAct loop for transparent and adaptable chest X-ray interpretation.
- MedRAX outperforms general and specialized medical models on the new ChestAgentBench, achieving 63.1% accuracy on complex clinical reasoning tasks.
- MedRAX's modular architecture allows integration of diverse tools without retraining, enhancing transparency and adaptability in clinical settings.
The paper introduces MedRAX, an agent-based framework that unifies state-of-the-art chest X-ray (CXR) interpretation tools and large-scale multimodal reasoning capabilities under a structured ReAct loop. The framework is designed to decompose complex medical queries into a series of sequential analytical steps—observation, reasoning, and action—using a LLM. This design enables dynamic tool selection and iterative refinement, thereby integrating heterogeneous models specialized in visual QA, segmentation, grounding, report generation, disease classification, and image synthesis.
MedRAX leverages an underlying ReAct loop that coordinates multiple specialized modules such as:
- Visual Question Answering (VQA): Incorporation of models like CheXagent and LLaVA-Med facilitates fine-grained visual reasoning on CXRs, enabling the system to answer free-form clinical queries.
- Segmentation: Tools like MedSAM and a PSPNet model trained on ChestX-Det are used to partition images into anatomically meaningful regions.
- Grounding: The framework employs Maira-2 to accurately localize textual descriptions in the imaging data.
- Report Generation and Disease Classification: A SwinV2 Transformer coupled with a BERT decoder and a DenseNet-121-based classifier from TorchXRayVision support structured report generation and multi-label pathology classification.
- Chest X-ray Generation: RoentGen synthesizes realistic CXRs, facilitating augmented tool-based reasoning.
The technical novelty of MedRAX lies in its ability to orchestrate these disparate tools without requiring additional training, thus promoting adaptability and transparency in decision-making. A memory component maintains past interactions and outputs, which is crucial for multi-turn dialogues and avoiding redundant computations. The framework is modular by design, allowing easy integration of new tools through well-defined input/output specifications, and it can be deployed in various clinical settings—from local installations to cloud-based architectures—fulfilling data privacy requirements.
A critical contribution of the paper is the development of ChestAgentBench, a benchmark compiled from 675 expert-curated clinical cases to generate 2,500 six-choice questions. This benchmark segments evaluation into seven clinically relevant categories:
- Detection
- Classification
- Localization
- Comparison
- Relationship
- Diagnosis
- Characterization
The quantitative evaluation shows that MedRAX achieves an overall accuracy of 63.1% on ChestAgentBench, outperforming general-purpose models such as GPT-4o (56.4%) and Llama-3.2-90B (57.9%), as well as specialized medical models like CheXagent (39.5%) and LLaVA-Med (28.7%). Similarly, on the CheXbench evaluation—comprising visual QA tasks (Rad-Restruct and SLAKE datasets) and fine-grained image-text reasoning—the framework demonstrates superior performance, with notable improvements on Rad-Restruct (achieving 68.7%) and SLAKE (achieving 82.9%) when compared to the other baselines.
The paper also discusses detailed case studies demonstrating MedRAX's proficiency in resolving conflicting tool outputs. In one instance, MedRAX successfully identifies a chest tube by integrating conclusions from both report generation and visual QA, despite conflicting interpretations from one of its modules. In another case, the system accurately diagnoses a left pneumothorax by sequentially leveraging disease identification and segmentation outputs, rectifying misinterpretations exhibited by end-to-end models.
Key insights articulated in the discussion include:
- Task Decomposition: The ReAct-based iterative reasoning enhances transparency, interpretability, and facilitates error-tracing when integrating multiple domain-specific tools.
- Generalist versus Specialist Capabilities: The findings denote that general-purpose foundation models, when equipped with structured tool orchestration, can surpass specialized medical models in complex reasoning tasks.
- Limitations and Future Directions: While the framework exhibits strong performance, challenges remain in resolving contradictory outputs from different modules and managing the computational overhead associated with running multiple specialized tools. Future work is advised to explore reinforcement learning strategies for uncertainty-aware reasoning and more balanced tool utilization, aiming to further optimize clinical decision-making processes.
Overall, the paper presents a cohesive and implementable framework that not only achieves superior performance on standardized benchmarks but also offers a transparent, modular, and adaptable architecture for clinically-relevant CXR interpretation.