Papers
Topics
Authors
Recent
Search
2000 character limit reached

MedAgent-Zero: Zero-Shot Medical Reasoning

Updated 17 March 2026
  • MedAgent-Zero is a framework that partitions a single LLM into specialized agents simulating expert medical reasoning without additional training.
  • It employs distinct roles for question analysis, option evaluation, summarization, and iterative consensus to ensure robust decision-making.
  • Empirical evaluation shows that MedAgent-Zero achieves state-of-the-art zero-shot accuracy on medical QA benchmarks, outperforming methods like CoT and few-shot baselines.

MedAgent-Zero is a collaborative, agent-based framework designed to harness the latent medical reasoning capabilities of LLMs in a zero-shot, training-free setting. Developed specifically to address domain-adaptation challenges in medicine, MedAgent-Zero systematically partitions a single LLM instance into specialized agents with distinct medical roles. Each agent role is invoked through carefully designed prompts, enabling nuanced, multi-expert discourse, iterative summarization, critical debate, and consensus-driven decision-making, all without in-context demonstrations or parameter updates. Empirical evaluation demonstrates that MedAgent-Zero achieves state-of-the-art zero-shot accuracy on a suite of medical QA benchmarks, surpassing both simple prompting and advanced chain-of-thought (CoT) methods (Tang et al., 2023).

1. Agent Roles and Overall Architecture

MedAgent-Zero models a single LLM, such as GPT-3.5-Turbo or GPT-4, as a coordinated team of virtual agents, each simulating expertise in distinct medical subfields. The primary roles include:

  • Question-Domain Experts (QD): A set of mm agents, each adopting a system role corresponding to a selected medical specialty (e.g., cardiology, pulmonology). These agents initially interpret the clinical scenario.
  • Option-Domain Experts (OD): nn agents, each responsible for adjudicating answer options based on their designated medical subfields.
  • Medical Report Assistant (Summarizer): This agent synthesizes all individual expert analyses into a coherent, structured report emphasizing “Key Knowledge” and “Total Analysis.”
  • Collaborative Voters/Editors: The union of QD and OD agents participate as reviewers, voting on report drafts, proposing edits, and facilitating iterative consensus.
  • Decision Maker: Given the unanimously approved report, a final role determines the single best answer.

Each agent is purely instantiated via prompt engineering using constructs such as “You are a [Clinical Domain] Expert,” with no model fine-tuning, gradient updates, or external context.

2. Stage-Wise Workflow and Mathematical Formulation

MedAgent-Zero executes medical reasoning in five critical stages:

  1. Expert Gathering:
    • QD = LLM(qq, rqdr_{qd}, promptqdprompt_{qd})
    • OD = LLM(qq, opop, rodr_{od}, promptodprompt_{od})
  2. Individual Analyses:
    • For each qdiQDqd_i \in QD: qai=LLM(q;rqa,promptqa)qa_i = LLM(q; r_{qa}, prompt_{qa})
    • For each odjODod_j \in OD: oaj=LLM(q,op,{qai};roa,promptoa)oa_j = LLM(q, op, \{qa_{i}\}; r_{oa}, prompt_{oa})
  3. Report Summarization:
    • Repo0=LLM({qai},{oaj};rrs,promptrs)\text{Repo}_0 = LLM(\{qa_i\}, \{oa_j\}; r_{rs}, prompt_{rs})
  4. Iterative Multi-Round Discussion:
    • For D=QDODD = QD \cup OD, perform collaborative consultation:
      • Each agent did_i votes (“yes”/“no”) using LLM(Rcur;role=di,prompt=pvote)LLM(R_{cur}; role=d_i, prompt=p_{vote})
      • If any vote is “no”, LLM(Rcur;role=di,prompt=pmod)LLM(R_{cur}; role=d_i, prompt=p_{mod}) proposes edits
      • Summarizer integrates all suggested edits with LLM(Rcur,{Modi};rrs,promptrev)LLM(R_{cur}, \{Mod_i\}; r_{rs}, prompt_{rev})
      • Repeat until all votes are “yes” or maximum iteration kk is reached
    • Final output is the unanimous report RfR_f, where consensus is defined as nagree=i=1N1(votei=yes)=Nn_{agree} = \sum_{i=1}^N \mathbb{1}(\text{vote}_i = “yes”) = N
  5. Final Decision Making:
    • ans=LLM(q,op,Rf;rdm,promptdm)ans = LLM(q, op, R_f; r_{dm}, prompt_{dm}), yielding an answer of the form “Option: [A/B/C/D/E]”

All information flow is unidirectional: experts \Rightarrow analyses \Rightarrow synthesis \Rightarrow consensus \Rightarrow decision, and no learning or memory is retained across tasks.

3. Zero-Shot Operational Principles

MedAgent-Zero strictly operates in a zero-shot regime—no in-context exemplars or few-shot demonstrations are used at any pipeline stage. Only natural-language prompts and role designations guide behavior. Sampling parameters are consistently set to temperature = 1.0, top_p = 1.0, except for self-consistency (SC) runs (temperature = 0.7, 5 samples). Expert agents are initialized with simple, explicit identity statements (e.g., “You are a [Cardiology Expert]”), which structure reasoning pipelines without reliance on additional data modalities. There is no model adaptation, training, or learning between queries.

4. Empirical Evaluation and Benchmarking

MedAgent-Zero was evaluated on nine medical QA datasets: MedQA (USMLE), MedMCQA (AIIMS/NEET PG), PubMedQA, and six medical subtasks from MMLU (Anatomy, Clinical Knowledge, College Medicine, Medical Genetics, Professional Medicine, College Biology). For each dataset, accuracy was measured on a randomly selected sample of 300 questions, using GPT-3.5-Turbo and GPT-4 accessed through Azure OpenAI endpoints. Results are summarized below:

Model Zero-shot Zero-shot+CoT+SC MedAgents
GPT-3.5 67.8% 70.9% 72.1%
GPT-4 80.6% 83.0% 86.7%

MedAgent-Zero achieved the highest average zero-shot accuracy across all datasets, outperforming not only zero-shot and CoT+SC (Chain-of-Thought with Self-Consistency) prompting, but also strong few-shot CoT baselines and Flan-PaLM.

5. Ablation Analysis and Component Contribution

Ablation studies on MedMCQA revealed the incremental effect of each modular stage within the MedAgent-Zero pipeline:

  • Direct prompting: 49.0% accuracy
  • Chain-of-Thought (CoT): 55.0% (+6.0)
  • +Analysis Proposition (“Anal”): 62.0% (+7.0)
  • +Summarization (“Summ”): 65.0% (+3.0)
  • +Collaborative Consultation (“Cons”): 67.0% (+2.0)

Peak performance was observed with m=5m=5 question experts and n=2n=2 option experts, with accuracy rising steadily as these counts increased. Error analysis (40 failures annotated by humans) attributed 77% of errors to domain-knowledge gaps (either omission or misretrieval), 8% to CoT hallucination errors, and the remainder to reasoning or consistency deviations. These results indicate that the principal performance gains are attributable to structured, role-based expert simulation and collaborative analysis (Tang et al., 2023).

6. Significance and Applicability

MedAgent-Zero demonstrates that partitioning a single LLM using role-engineered agents enables the mining and recombination of its embedded medical expertise for clinical reasoning tasks. The framework does not require model fine-tuning, external tools, or supplementary training data, which broadens its applicability to real-world, resource-constrained settings where zero-shot and training-free solutions are preferred. The methodology affords a template for extending LLM capability in other high-precision, domain-specific fields. Its training-free, multi-agent collaboration strategy provides a state-of-the-art procedural baseline for future zero-shot medical reasoning systems (Tang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedAgent-Zero.