Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 43 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use (2505.14963v1)

Published 20 May 2025 in cs.CL

Abstract: LLMs are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases -- trials, primary studies, regulatory documents, and cost data -- under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent's ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at: https://moreirap12.github.io/mbc-browse-app/

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a new benchmark designed to evaluate LLM agents' multi-hop reasoning in complex clinical scenarios, revealing significant performance gaps.
  • It demonstrates that initializing agents with structured resources like HemOnc.org significantly boosts accuracy in tasks such as PMID extraction.
  • The study underscores the need for robust retrieval, synthesis, and conflict resolution strategies to meet the rigorous demands of clinical decision support.

The paper "MedBrowseComp: Benchmarking Medical Deep Research and Computer Use" (2505.14963) introduces a new benchmark designed to evaluate the capability of LLM based agents to perform reliable medical information seeking and synthesis from live, domain-specific knowledge bases. The authors highlight the increasing vision of LLMs as clinical decision-support tools but point out that existing evaluations often fall short of the rigor required for safe clinical reasoning. These evaluations typically rely on synthetic data, focus on single-hop factoid queries, or blend reasoning with open-ended generation, which doesn't accurately reflect real-world clinical needs.

To address this gap, MedBrowseComp provides over 1,000 human-curated questions structured to mirror clinical scenarios. These questions require agents to integrate heterogeneous knowledge bases, such as clinical trials, primary studies, and regulatory documents, and often involve reconciling fragmented or conflicting information to reach an up-to-date conclusion. A key characteristic of MedBrowseComp is its focus on multi-hop reasoning, where an agent must navigate through multiple pieces of information or steps to arrive at the answer. The benchmark also emphasizes retrieving information from live, domain-specific sources, rather than static datasets or the model's pre-existing knowledge.

The paper evaluates several frontier agentic systems on the MedBrowseComp benchmark, revealing significant performance shortfalls. The results demonstrate that current LLM agents struggle particularly with multi-hop questions. While performance is higher on 1-hop queries, it drops considerably as the number of reasoning steps increases. The benchmark introduces a "REAL accuracy" metric, which excludes responses like "Not applicable" to penalize models that avoid answering challenging questions, aiming to test their ability to provide a definitive, correct answer when one exists. Even under less strict total accuracy metrics, performance is low, and the REAL accuracy highlights the critical gap between current capabilities and the strict accuracy demands of clinical settings. For instance, one of the best-performing models achieved a REAL accuracy of only 24.5% across the MedBrowseComp 605 dataset, with near-zero performance on 4-hop and 5-hop questions under this strict metric.

The paper also explores the performance of a Computer Use Agent (CUA) and investigates the impact of starting the agent from a structured, domain-specific resource like HemOnc.org. They found that initializing tasks from the HemOnc.org homepage significantly improved the agent's accuracy on structured extraction tasks compared to starting with no defined entry point. For example, accuracy for PMID extraction increased from 11.57% to 42.98%. This suggests that providing agents with structured, high-quality starting points can be crucial for improving their performance on complex real-world tasks, reducing ambiguity and minimizing navigation errors.

From a practical implementation perspective, the findings underscore the need for:

  • Improved multi-hop reasoning capabilities in LLM agents designed for knowledge-intensive domains like medicine.
  • Robust retrieval and synthesis mechanisms that can reliably extract and integrate information from diverse, dynamic sources.
  • Strategies for handling conflicting or fragmented information common in medical literature.
  • Consideration of user interface and workflow design that provides agents (and potentially human users) with optimal entry points into complex knowledge bases.

The benchmark and associated resources (dataset available on Hugging Face at https://huggingface.co/datasets/AIM-Harvard/MedBrowseComp and code on GitHub at https://github.com/shan23chen/MedBrowseComp) are intended to serve as a testbed for developing and evaluating future generations of medical AI systems, pushing them towards greater reliability and accuracy in complex information-seeking tasks. The authors also include important ethical considerations, noting that the benchmark is built from publicly available data and is for research purposes only, emphasizing that real-world clinical deployment requires qualified human oversight due to current system limitations. The provided appendix details the benchmark settings, LLM-as-judge prompts, and computer use agent prompts, facilitating reproducibility and further research.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub