MedAgent-Zero: Zero-Shot Medical Reasoning
- MedAgent-Zero is a framework that partitions a single LLM into specialized agents simulating expert medical reasoning without additional training.
- It employs distinct roles for question analysis, option evaluation, summarization, and iterative consensus to ensure robust decision-making.
- Empirical evaluation shows that MedAgent-Zero achieves state-of-the-art zero-shot accuracy on medical QA benchmarks, outperforming methods like CoT and few-shot baselines.
MedAgent-Zero is a collaborative, agent-based framework designed to harness the latent medical reasoning capabilities of LLMs in a zero-shot, training-free setting. Developed specifically to address domain-adaptation challenges in medicine, MedAgent-Zero systematically partitions a single LLM instance into specialized agents with distinct medical roles. Each agent role is invoked through carefully designed prompts, enabling nuanced, multi-expert discourse, iterative summarization, critical debate, and consensus-driven decision-making, all without in-context demonstrations or parameter updates. Empirical evaluation demonstrates that MedAgent-Zero achieves state-of-the-art zero-shot accuracy on a suite of medical QA benchmarks, surpassing both simple prompting and advanced chain-of-thought (CoT) methods (Tang et al., 2023).
1. Agent Roles and Overall Architecture
MedAgent-Zero models a single LLM, such as GPT-3.5-Turbo or GPT-4, as a coordinated team of virtual agents, each simulating expertise in distinct medical subfields. The primary roles include:
- Question-Domain Experts (QD): A set of agents, each adopting a system role corresponding to a selected medical specialty (e.g., cardiology, pulmonology). These agents initially interpret the clinical scenario.
- Option-Domain Experts (OD): agents, each responsible for adjudicating answer options based on their designated medical subfields.
- Medical Report Assistant (Summarizer): This agent synthesizes all individual expert analyses into a coherent, structured report emphasizing “Key Knowledge” and “Total Analysis.”
- Collaborative Voters/Editors: The union of QD and OD agents participate as reviewers, voting on report drafts, proposing edits, and facilitating iterative consensus.
- Decision Maker: Given the unanimously approved report, a final role determines the single best answer.
Each agent is purely instantiated via prompt engineering using constructs such as “You are a [Clinical Domain] Expert,” with no model fine-tuning, gradient updates, or external context.
2. Stage-Wise Workflow and Mathematical Formulation
MedAgent-Zero executes medical reasoning in five critical stages:
- Expert Gathering:
- QD = LLM(, , )
- OD = LLM(, , , )
- Individual Analyses:
- For each :
- For each :
- Report Summarization:
- Iterative Multi-Round Discussion:
- For , perform collaborative consultation:
- Each agent votes (“yes”/“no”) using
- If any vote is “no”, proposes edits
- Summarizer integrates all suggested edits with
- Repeat until all votes are “yes” or maximum iteration is reached
- Final output is the unanimous report , where consensus is defined as
- For , perform collaborative consultation:
- Final Decision Making:
- , yielding an answer of the form “Option: [A/B/C/D/E]”
All information flow is unidirectional: experts analyses synthesis consensus decision, and no learning or memory is retained across tasks.
3. Zero-Shot Operational Principles
MedAgent-Zero strictly operates in a zero-shot regime—no in-context exemplars or few-shot demonstrations are used at any pipeline stage. Only natural-language prompts and role designations guide behavior. Sampling parameters are consistently set to temperature = 1.0, top_p = 1.0, except for self-consistency (SC) runs (temperature = 0.7, 5 samples). Expert agents are initialized with simple, explicit identity statements (e.g., “You are a [Cardiology Expert]”), which structure reasoning pipelines without reliance on additional data modalities. There is no model adaptation, training, or learning between queries.
4. Empirical Evaluation and Benchmarking
MedAgent-Zero was evaluated on nine medical QA datasets: MedQA (USMLE), MedMCQA (AIIMS/NEET PG), PubMedQA, and six medical subtasks from MMLU (Anatomy, Clinical Knowledge, College Medicine, Medical Genetics, Professional Medicine, College Biology). For each dataset, accuracy was measured on a randomly selected sample of 300 questions, using GPT-3.5-Turbo and GPT-4 accessed through Azure OpenAI endpoints. Results are summarized below:
| Model | Zero-shot | Zero-shot+CoT+SC | MedAgents |
|---|---|---|---|
| GPT-3.5 | 67.8% | 70.9% | 72.1% |
| GPT-4 | 80.6% | 83.0% | 86.7% |
MedAgent-Zero achieved the highest average zero-shot accuracy across all datasets, outperforming not only zero-shot and CoT+SC (Chain-of-Thought with Self-Consistency) prompting, but also strong few-shot CoT baselines and Flan-PaLM.
5. Ablation Analysis and Component Contribution
Ablation studies on MedMCQA revealed the incremental effect of each modular stage within the MedAgent-Zero pipeline:
- Direct prompting: 49.0% accuracy
- Chain-of-Thought (CoT): 55.0% (+6.0)
- +Analysis Proposition (“Anal”): 62.0% (+7.0)
- +Summarization (“Summ”): 65.0% (+3.0)
- +Collaborative Consultation (“Cons”): 67.0% (+2.0)
Peak performance was observed with question experts and option experts, with accuracy rising steadily as these counts increased. Error analysis (40 failures annotated by humans) attributed 77% of errors to domain-knowledge gaps (either omission or misretrieval), 8% to CoT hallucination errors, and the remainder to reasoning or consistency deviations. These results indicate that the principal performance gains are attributable to structured, role-based expert simulation and collaborative analysis (Tang et al., 2023).
6. Significance and Applicability
MedAgent-Zero demonstrates that partitioning a single LLM using role-engineered agents enables the mining and recombination of its embedded medical expertise for clinical reasoning tasks. The framework does not require model fine-tuning, external tools, or supplementary training data, which broadens its applicability to real-world, resource-constrained settings where zero-shot and training-free solutions are preferred. The methodology affords a template for extending LLM capability in other high-precision, domain-specific fields. Its training-free, multi-agent collaboration strategy provides a state-of-the-art procedural baseline for future zero-shot medical reasoning systems (Tang et al., 2023).