- The paper introduces Aegle, a framework that virtualizes multi-disciplinary reasoning via a dynamic multi-agent system to reduce diagnostic biases.
- The paper details a structured SOAP schema and dynamic specialist activation to decouple evidence collection from diagnostic synthesis.
- The paper demonstrates significant improvements in IDEA and SOAP metrics with over 20% diagnostic accuracy lift relative to baseline models.
Virtualizing Multi-Disciplinary Medical Reasoning: Aegle’s Multi-Agent Framework for Clinical Intake
Introduction and Motivation
Aegle directly addresses structural limitations in clinical intake workflows, where single-physician documentation frequently suffers from cognitive bias, evidence omission, and lack of reasoning depth. While Multi-Disciplinary Team (MDT) decision-making is the standard of care for complex scenarios, real-time reliance on MDT processes remains infeasible at scale. The proposed Aegle framework virtualizes MDT-level multi-perspective reasoning within outpatient consults using a graph-based multi-agent system, bridging the operational divide between the robustness of MDT and the resource constraints of real-world outpatient intake.
Figure 1: Single-agent LLMs are prone to anchoring bias and fragmented evidence capture, while virtual MDTs (as instantiated by Aegle) support parallel specialty reasoning and coherent integration, enhancing coverage and diagnostic traceability.
Framework Architecture
Aegle’s architecture establishes formal separation between evidence collection and diagnostic synthesis by grounding its consultation state in a structured SOAP schema, St=[Ft,Pt]. This separation enforces traceability by constraining hypothesis generation to be downstream of evidentiary sufficiency, mitigating bias from early diagnostic fixation.
Agent roles are orchestrated via a state-aware, dynamic topology:
- Orchestrator: Implements a dynamic specialist activation policy, engaging agents based on case-specific uncertainty and the evolving evidence matrix.
- Specialist Agents: Execute independent, domain-constrained reasoning in parallel. Isolation delays premature consensus and supports hypothesis diversity.
- Aggregator: Integrates state proposals with explicit write-then-speak separation; internal state updates precede patient-facing utterances, guaranteeing documentation consistency.
Figure 2: Diagram of the Aegle consultation pipeline, demonstrating the transition from iterative history taking and on-demand specialist activation to stabilized evidence and diagnostic synthesis.
Figure 3: History-taking phase with structured parallel inquiry and iterative note refinement, exemplified by a pediatric cardiology case.
Stagewise execution ensures that only stabilized evidence feeds diagnostic synthesis; F is frozen prior to P generation, enforcing unidirectional logical dependency.
Experimental Framework
Evaluation is robust, spanning 24 departments and 53 metrics using ClinicalBench and RAPID-IPN—datasets designed to stress both breadth and depth of diagnostic challenges:
- ClinicalBench: Synthetic cohort with strict data-leakage controls, emphasizing generalization and open-ended clinical generation.
- RAPID-IPN: Real-world abdominal pain cohort, annotated by senior physicians, demanding precision in differential and longitudinal documentation.
Baselines comprise proprietary/open LLMs, chain-of-thought (CoT) and tree-of-thought (ToT) reasoning strategies, and agent-based frameworks (MDAgents, MedAgents), isolating the contribution of Aegle’s architecture.
Results
Documentation Quality
Aegle achieves consistent, statistically significant improvements in both structured reasoning (IDEA) and documentation standardization (SOAP) across datasets and departments. In rigorous fixed-backbone benchmarking (DeepSeek-V3.2), Aegle’s architecture yields up to 8.6 absolute gains in IDEA and 3.5 in SOAP over the strongest agentic baselines, with improvement most concentrated in complex specialties demanding cross-domain integration.
Figure 5: Department-wise IDEA and SOAP scores, displaying consistent superiority of Aegle across 24 clinical domains, with notable gaps in high-ambiguity specialties.
Textual metrics (READ, chrF++) are saturated across advanced LLMs, but Aegle’s architectural constraints manifest in increased evidential coherence and decreased omission, not merely stylistic fluency.
Diagnostic Accuracy
When subjected to end-to-end consultation-to-diagnosis evaluation (ClinicalBench), Aegle delivers absolute diagnostic accuracy of 46.9%—outperforming CoT, ToT, MDAgents, and MedAgents, and representing a >20% lift over matched single-model performance. This demonstrates that multi-perspective evidence gathering, when structurally coordinated, directly improves downstream clinical correctness, not just documentation fidelity.
Consultation Efficiency and Resource Utilization
Aegle’s dynamic specialist activation realizes a substantial reduction in agent invocation—less than half the per-round expert utilization of static multi-agent baselines. This enhances deployment feasibility by minimizing redundant computation without degrading reasoning quality, a nontrivial advance for real-time clinical applications.
Component Contribution and Ablation
Systematic ablation identifies structured state grounding as the dominant performance driver. Excision of the explicit F/P split leads to catastrophic degradation in both reasoning quality and evidence traceability. Removing generative inquiry inflates consultation length and induces evidence scattering, while omitting dynamic topology or decoupled reasoning uniformly depresses reasoning diversity and accuracy.
Figure 6: Ablation analysis of Aegle components, demonstrating critical dependence on explicit state representation and generative, dynamic inquiry mechanisms.
Validation of LLM-as-a-Judge
Quantitative correlation with practicing physician review (Pearson r > 0.66 across all metrics) validates the use of LLM-based rubric scoring for scalable, multi-dimensional evaluation—supporting the robustness of the presented results.
Figure 7: Correlation matrix between LLM-judge and human ratings on IDEA, SOAP, and readability metrics, confirming rubric alignment.
Case Study Analysis
Aegle preserves clinically actionable low-salience features (e.g., IPSS score, granular pathology details) in high-complexity cases, resulting in more nuanced risk stratification and contextually appropriate diagnostic/management plans compared to CoT/ToT and prior multi-agent systems. The decoupled MDT emulation enables integration of weak signals that are frequently eclipsed by dominant clinical findings in single-agent models.
Figure 9: Example IPN progression generated by Aegle in a complex prostate cancer case, capturing granular evidence missed by baseline models.
Figure 11: Continuation of the case, demonstrating the plan’s coherence and evidentiary traceability up to specialist referral.
Theoretical Implications
Aegle’s framework formalizes essential best practices in collaborative medicine within a tractable, synchronous agentic system. By enforcing explicit evidence-diagnosis separation and dynamic role assignment, it bridges cognitive science insights with computational tractability. The architecture offers a reference standard for virtualized collaborative reasoning in high-variance, context-dependent AI decision tasks.
Practical and Future Directions
Practically, Aegle establishes a route toward scalable, bias-minimized AI-based clinical documentation. Its dynamic topology supports adaptive resource allocation, which is critical for real-world deployment under latency and cost constraints. The framework’s structure is directly extendable: targeted improvements in aggregation, redundancy penalization, and diversity-aware specialist prompting are immediate frontiers for optimizing inter-agent synthesis.
Future research should systematically balance hypothesis diversity (de-biasing) against aggregation redundancy and explore fine-grained, audit-friendly traceability—especially for regulatory and medico-legal downstream applications. Extending the architecture to support richer temporal reasoning and referral chain handoff would further align virtual AI systems with full-cycle MDT clinical workflows.
Conclusion
Aegle demonstrates that virtual MDT architectures, grounded in structured state evolution and controlled multi-agent specialization, significantly surpass both single-LLM and existing multi-agent models in clinical documentation and diagnostic performance. These findings have direct translational relevance for robust, scalable AI decision support and establish foundational principles for the next generation of agentic healthcare intelligence systems.
Reference: "Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents" (2604.08927)