MDTeamGPT: Multi-disciplinary Medical LLM
- MDTeamGPT is a collaborative multi-agent system designed for multi-disciplinary medical consultations using expertise-aware recruitment.
- It utilizes residual discussion protocols and structured knowledge bases to aggregate specialist inputs into consensus-driven recommendations.
- Empirical evaluations demonstrate improved accuracy over single-agent models, ensuring robust, interpretable clinical decision support.
MDTeamGPT is a collaborative LLM framework for multi-disciplinary team (MDT) medical consultation, implementing expertise-aware multi-agent orchestration, evidence-driven consensus, and self-evolution via structured knowledge bases. It combines adaptive recruitment of specialist expert agents with adversarial and consensus-driven answer synthesis, continuous learning from consultation history, and robust evaluation against physician-generated benchmarks. MDTeamGPT is positioned as a high-accuracy, interpretable, and adaptable architecture for both clinical decision support and medical question answering, capable of handling real-world complexity in medical team workflows (Chen et al., 18 Mar 2025, Bao et al., 19 Aug 2025).
1. System Architecture and Agent Roles
MDTeamGPT operationalizes a multi-stage agent-based workflow, reflecting key MDT processes in clinical settings. The architecture typically comprises:
- Patient Agent: Presents case background and medical query .
- Primary Care Doctor Agent: Receives , selects required specialist roles (e.g., General Internal Medicine, Radiology, Neurology), and justifies selection.
- Specialist Doctor Agents: Parallel instantiation for each specialty in , generating candidate answers per round.
- Lead Physician Agent: Aggregates and summarizes specialist outputs across four categories: Consistency, Conflict, Independence, Integration. Drives consensus aggregation.
- Chain-of-Thought Reviewer Agent: Extracts reasoning steps and updates knowledge bases (CorrectKB/ChainKB) with correct and erroneous chains, respectively.
- Safety & Ethics Reviewer Agent: Validates final output for safety/ethical compliance.
The system employs a residual discussion structure: each specialist prompt in round accesses prior summaries () to minimize information loss and enable experience reuse across rounds (Chen et al., 18 Mar 2025).
2. Expertise-Aware Recruitment and Dynamic Agent Selection
At its core, MDTeamGPT implements an Expertise-aware Multi-LLM Recruitment and Collaboration (EMRC) framework. Its primary stages are:
- LLM Expertise Table Construction: Offline, a publicly available corpus (e.g., MedQA-valid) is used to record for each candidate LLM its classification and answering accuracy across medical departments and difficulty levels.
- Dynamic Agent Selection: Online, given a medical query , MDTeamGPT classifies by department/difficulty, consults the expertise table, and recruits the top LLMs using a scoring function (where is department, is difficulty, ).
- Confidence Fusion: Each agent returns answer and self-confidence ; a fused score () weights the agent in voting or aggregation.
- Adversarial Validation: The highest-scoring agent is assigned as Judge, provides error signals or critiques for peer agents, and may trigger answer refinement.
- Final Aggregation: Aggregator LLM combines all responses, confidences, and error signals to produce the final diagnosis or recommendation (Bao et al., 19 Aug 2025).
Parameter selection (e.g., agents, $2$ collaboration layers) is validated empirically for robustness and efficiency.
3. Consensus Aggregation, Residual Discussion, and Knowledge Bases
MDTeamGPT’s answer synthesis relies on both formal consensus protocols and persistent learning across sessions:
- Consensus Rule: After each round, outputs from all specialists are compared. If all agree, the answer is accepted; otherwise, after , majority voting is applied.
- Residual Discussion: Specialist prompts in round are augmented by four-category summaries , constructed by the Lead Physician Agent, facilitating memory-efficient, stable convergence (Chen et al., 18 Mar 2025).
- Structured Knowledge Bases:
- CorrectKB: Archives correct question–answer–summary quadruplets for fast retrieval and case augmentation in future sessions.
- ChainKB: Captures full error-chain trajectories for incorrect cases, including analysis process and error reflections. Similarity-based retrieval injects top past chains to inform ongoing consultations.
- Self-Evolution: After each session, the final output is cross-verified against a gold label; CorrectKB or ChainKB are updated accordingly. Retrieval from these stores enables few-shot in-context learning without explicit model fine-tuning.
4. Orchestration, Implementation, and Practical Guidance
MDTeamGPT is designed for scalability and maintainability in real-world settings. Key considerations include:
- Orchestration: Each agent is wrapped as a microservice; a central controller handles classification, agent selection, fanning out queries, collecting responses and confidences, running the Judge module, and aggregating.
- Scalability: Dynamic instantiation of up to four agents per query balances diversity and computational load; microservice caching and periodic expertise table refresh cycles improve performance.
- Latency and Throughput: Multi-round group discussions incur higher latency (seconds per query), justified by improved accuracy (e.g., MedQA, PubMedQA; average ) (Chen et al., 18 Mar 2025). Adaptive recruitment in EMRC improves scalability over static agent pools (Bao et al., 19 Aug 2025).
- Safety and Monitoring: Role-based safety/effectiveness reviewers, logging, and consistent output vetting are advised for compliance and mitigation of hallucinations and ethical risk.
5. Empirical Performance and Benchmarking
MDTeamGPT has been evaluated on standard medical QA benchmarks and ablation studies:
| Method / Dataset | MedQA Acc | PubMedQA Acc | Avg Acc |
|---|---|---|---|
| Single-Agent | 77.4% | 75.3% | 76.4 |
| Multi-expert Prompting | 86.6% | 82.7% | 84.2 |
| MedAgents [13] | 83.7% | 76.8% | 80.3 |
| MDTeamGPT (self-evolving, full) | 90.1% | 83.9% | 87.0 |
Ablation confirms that removing residual discussion, lead physician aggregation, or either CorrectKB/ChainKB degrades accuracy by 2–10%. EMRC methodology on MMLU-Pro-Health achieves over GPT-4-0613 and over naive multi-agent selection (Bao et al., 19 Aug 2025, Chen et al., 18 Mar 2025).
6. Evaluation and Ethical Considerations
Rigorous evaluation, both quantitative and qualitative, is integral to MDTeamGPT’s deployment:
- Criteria: Considers medical professional accuracy, logic, informativeness, expansiveness, social interaction, empathy, and computational robustness (Xu et al., 2023), incorporating multi-expert contextual and conflict-resolution metrics.
- Evaluation Protocols: Blind scoring by clinicians, stratified specialty/difficulty performance monitoring, and dedicated ablation for conflict resolution and evidence synthesis robustness are recommended (Chen et al., 18 Mar 2025).
- Ethics and Safety: Advisory boards, user flagging mechanisms, bias audits, and privacy safeguards (HIPAA/GDPR compliance, de-identification) mitigate risks of misdiagnosis, data breaches, and health disparities (Mingole et al., 13 Jun 2025).
7. Future Directions and Limitations
MDTeamGPT’s foundational architecture is extensible across several axes:
- Incorporation of Retrieval-Augmented Generation (RAG): Integration of federated, multi-institutional knowledge sources (e.g., UMLS, SNOMED CT, PubMed) for real-time evidence retrieval (Mingole et al., 13 Jun 2025, Chen et al., 18 Mar 2025).
- Online Learning: Per-query expertise updating and agent performance tracking for dynamic expertise table refinement.
- Human-in-the-Loop: Deferred handoff to clinicians for low-confidence or high-risk cases.
- Multi-Modal Expansion: Enabling radiology, pathology, EHR text, and structured laboratory data handling via multi-modal LLMs (Kim et al., 2024, Kim et al., 2024).
- Efficiency Optimizations: Early-exit consensus, dynamic agent scheduling, and memory-efficient inference for deployment at clinical scale.
- Benchmarking: Continuous comparison against public and in-the-wild clinical dialogue datasets, adopting and extending the MedGPTEval rubric for robust, interpretable assessment (Xu et al., 2023).
Limitations include persistent risks from foundational LLM hallucinations, table staleness as new models emerge, and untested real-world robustness in clinical environments. Ongoing work is required on real-world datasets, dynamic agent orchestration, and further integration of ethical and regulatory frameworks.