- The paper presents a new benchmark designed to evaluate LLM agents' multi-hop reasoning in complex clinical scenarios, revealing significant performance gaps.
- It demonstrates that initializing agents with structured resources like HemOnc.org significantly boosts accuracy in tasks such as PMID extraction.
- The study underscores the need for robust retrieval, synthesis, and conflict resolution strategies to meet the rigorous demands of clinical decision support.
The paper "MedBrowseComp: Benchmarking Medical Deep Research and Computer Use" (2505.14963) introduces a new benchmark designed to evaluate the capability of LLM based agents to perform reliable medical information seeking and synthesis from live, domain-specific knowledge bases. The authors highlight the increasing vision of LLMs as clinical decision-support tools but point out that existing evaluations often fall short of the rigor required for safe clinical reasoning. These evaluations typically rely on synthetic data, focus on single-hop factoid queries, or blend reasoning with open-ended generation, which doesn't accurately reflect real-world clinical needs.
To address this gap, MedBrowseComp provides over 1,000 human-curated questions structured to mirror clinical scenarios. These questions require agents to integrate heterogeneous knowledge bases, such as clinical trials, primary studies, and regulatory documents, and often involve reconciling fragmented or conflicting information to reach an up-to-date conclusion. A key characteristic of MedBrowseComp is its focus on multi-hop reasoning, where an agent must navigate through multiple pieces of information or steps to arrive at the answer. The benchmark also emphasizes retrieving information from live, domain-specific sources, rather than static datasets or the model's pre-existing knowledge.
The paper evaluates several frontier agentic systems on the MedBrowseComp benchmark, revealing significant performance shortfalls. The results demonstrate that current LLM agents struggle particularly with multi-hop questions. While performance is higher on 1-hop queries, it drops considerably as the number of reasoning steps increases. The benchmark introduces a "REAL accuracy" metric, which excludes responses like "Not applicable" to penalize models that avoid answering challenging questions, aiming to test their ability to provide a definitive, correct answer when one exists. Even under less strict total accuracy metrics, performance is low, and the REAL accuracy highlights the critical gap between current capabilities and the strict accuracy demands of clinical settings. For instance, one of the best-performing models achieved a REAL accuracy of only 24.5% across the MedBrowseComp 605 dataset, with near-zero performance on 4-hop and 5-hop questions under this strict metric.
The paper also explores the performance of a Computer Use Agent (CUA) and investigates the impact of starting the agent from a structured, domain-specific resource like HemOnc.org. They found that initializing tasks from the HemOnc.org homepage significantly improved the agent's accuracy on structured extraction tasks compared to starting with no defined entry point. For example, accuracy for PMID extraction increased from 11.57% to 42.98%. This suggests that providing agents with structured, high-quality starting points can be crucial for improving their performance on complex real-world tasks, reducing ambiguity and minimizing navigation errors.
From a practical implementation perspective, the findings underscore the need for:
- Improved multi-hop reasoning capabilities in LLM agents designed for knowledge-intensive domains like medicine.
- Robust retrieval and synthesis mechanisms that can reliably extract and integrate information from diverse, dynamic sources.
- Strategies for handling conflicting or fragmented information common in medical literature.
- Consideration of user interface and workflow design that provides agents (and potentially human users) with optimal entry points into complex knowledge bases.
The benchmark and associated resources (dataset available on Hugging Face at https://huggingface.co/datasets/AIM-Harvard/MedBrowseComp and code on GitHub at https://github.com/shan23chen/MedBrowseComp) are intended to serve as a testbed for developing and evaluating future generations of medical AI systems, pushing them towards greater reliability and accuracy in complex information-seeking tasks. The authors also include important ethical considerations, noting that the benchmark is built from publicly available data and is for research purposes only, emphasizing that real-world clinical deployment requires qualified human oversight due to current system limitations. The provided appendix details the benchmark settings, LLM-as-judge prompts, and computer use agent prompts, facilitating reproducibility and further research.