Doctor Agent: Clinical AI Systems
- Doctor Agent is an AI system that emulates physician decision-making by combining interactive dialogue with modular, multimodal components.
- It integrates structured patient data, lab results, and medical images using inquiry-driven, POMDP/MDP frameworks for precise clinical reasoning.
- Advanced systems use consensus protocols, reinforcement learning, and safety layers to ensure high diagnostic accuracy and robust treatment recommendations.
A Doctor Agent is an artificial intelligence system, typically based on LLMs, specifically designed to emulate the decision-making, inquiry, and communication capabilities of a physician in clinical settings. These agents are implemented as autonomous or semi-autonomous multi-step dialogue systems that can actively collect patient history, synthesize multimodal data (text, labs, and medical images), form differential diagnoses, recommend treatments, and provide personalized medical advice. Doctor Agents may function independently, as the decision-making node in larger multi-agent healthcare systems, or as part of simulated healthcare environments for benchmarking and research.
1. Architectures and Organizational Patterns
Current Doctor Agent architectures are highly modular and often employ multi-agent or graph-based frameworks to mirror the complexity of clinical reasoning.
- Hierarchical and Modular Architectures: Doctor Agents are commonly constituted as directed graphs where nodes correspond to distinct functional modules—such as Knowledge Retrieval, Diagnostic Reasoner, Feature Extractor, Decision Synthesizer, and Tool Integrator. Workflows can be evolved through automated search in a large architecture space, with primitives that include node addition/removal, edge manipulation, conditional/loop constructs, and parallel execution (Zhuang et al., 15 Apr 2025). This enables dynamic adaptation to diverse diagnostic tasks and the iterative refinement of diagnostic pipelines.
- Multi-Agent Team Composition: More advanced settings substitute monolithic agents with orchestrated ensembles of specialized sub-agents. For example, MATEC for sepsis care defines domains for Emergency Medicine, Hospitalist, Infectious Disease, Critical Care, and Senior Physician agents, each responsible for a slice of the clinical pathway (Cho et al., 9 Feb 2025). RareAgents for rare disease leverages an Attending Physician Agent to orchestrate a multidisciplinary team of specialist agents, each with its own dynamic memory, tool access, and reasoning capabilities (Chen et al., 2024).
- Consensus and Collaboration Protocols: Multi-agent frameworks often employ consensus-building mechanisms—such as message-passing, blackboard architectures, or iterative debate with weighted voting—to synthesize final recommendations, control for agent error, or defend against adversarial collusion (Bashir et al., 1 Dec 2025, Yang et al., 26 Nov 2025).
- Self-Improvement and Memory: Doctor Agents routinely update internal buffers of experience or validated principles (MedicalRecordLib, ExperienceBase), retrieve similar past cases, or engage in reflection to distinguish and learn from incorrect actions (Li et al., 2024, Almansoori et al., 28 Mar 2025). Experience repositories and retrieval-augmented prompting are ubiquitous.
2. Clinical Reasoning, Dialogue, and Dynamic Inquiry
Doctor Agents operate through multi-turn, information-seeking dialogue modeled as a partially observable Markov decision process (POMDP) or Markov decision process (MDP):
- Inquiry Capability: Agents proactively interact with patient (or simulated-patient) agents, requesting information relevant to building a correct and efficient diagnosis (Gong et al., 29 Sep 2025, Feng et al., 26 May 2025). Inquiry proficiency is measured over dimensions such as coverage of atomic information units (AIUs), relevance, clarity, and coherence—in line with MAQuE, a benchmark for multi-turn medical dialogue (Gong et al., 29 Sep 2025).
- Dialogue Management: Agent environments provide rigorous turn-based protocols, typically alternating actions (questions, test requests) and observations (patient responses, measurement results), updating internal state accordingly (Almansoori et al., 28 Mar 2025, Schmidgall et al., 2024). Reflection and correction cycles may be triggered when a diagnosis fails or experiences accumulate, enabling on-policy adaptation and refinement (Dutta et al., 2024).
- Reward Shaping and Learning: Reinforcement learning frameworks (e.g., DoctorAgent-RL, Doctor-R1) exploit environment signal, splitting dense process rewards (for empathy, safety, coherence) from sparse terminal rewards (diagnostic/therapeutic accuracy) to jointly improve clinical reasoning and consultation dialogue (Feng et al., 26 May 2025, Lai et al., 5 Oct 2025). Learning is frequently guided by group or within-trajectory policy optimization, and by harvesting high-reward trajectories into experience repositories for prompt augmentation.
3. Data Integration, External Tools, and Multimodality
Modern Doctor Agents are increasingly tool-augmented and multimodal:
- External Tool Invocation: Integration with knowledge bases, medical device APIs, database queries, and computational models is facilitated by function-calling APIs and agent orchestration. Notable examples include the use of Phenomizer, LIRICAL, Phenobrain, DrugBank, and DDI-Graph for rare disease diagnosis and drug–drug interaction checking (Chen et al., 2024).
- Multimodal Reasoning: Some Doctor Agents, particularly in domains such as radiology (LungNoduleAgent), operate on both structured image data and text reports, using cross-modal attention to fuse feature embeddings and propagate knowledge through retrieval-augmented generation or knowledge graphs (Yang et al., 26 Nov 2025).
- Patient Data Handling: Agent environments ingest structured EHR, vitals, unstructured notes, and imaging via extractor modules and present them as unified state for agent consumption (Cho et al., 9 Feb 2025, Hayat et al., 27 Jun 2025).
4. Consensus Building, Adversarial Robustness, and Safety Layers
- Consensus Mechanisms: Doctor Agents operating in multi-agent settings rely on iterative debate, majority voting, or weighted aggregation (possibly via self-reported confidence) to arrive at a final decision (Yang et al., 26 Nov 2025, Cho et al., 9 Feb 2025). In adversarial contexts, colluding assistant agents can force a Doctor Agent toward incorrect or harmful prescriptions (Attack Success Rate and Harmful Recommendation Rate up to 100% in the absence of protection) (Bashir et al., 1 Dec 2025).
- Verifier and Safety Layers: Lightweight safety defenses, such as Verifier Agents that cross-reference prescriptions against gold-standard clinical guidelines, are demonstrated to restore 100% accuracy under collusion attack scenarios (Bashir et al., 1 Dec 2025).
- Transparency and Explainability: Modern frameworks enforce explainability via chain-of-thought logging, audit trails, and explicit evidence citation. Modular architectures allow all intermediate reasoning steps, external queries, and decision points to be logged and reviewed (Rose et al., 26 Feb 2025).
5. Evaluation Paradigms and Benchmarks
Doctor Agents are subject to comprehensive, scenario-based evaluation:
- Dialogue-Focused Benchmarks: MAQuE and AgentClinic are simulated environments assessing Doctor Agent inquiry, dialogue competence, efficiency, and patient experience under realistic patient models, including noise, ambiguity, and emotional variability (Gong et al., 29 Sep 2025, Schmidgall et al., 2024).
- End-to-End Clinical Simulation: Doctorina MedBench extends scenario realism with the D.O.T.S. metric, quantifying agent performance across diagnosis, investigations, treatment, and dialogue efficiency, and supporting trap-based safety evaluation and category-based random sampling (Kozlova et al., 26 Mar 2026).
- Real-World Deployment: Autonomous Doctor Agents (e.g., Doctronic) have achieved 81% diagnostic concordance and 99.2% treatment plan consistency with board-certified clinicians over 500 real-world telehealth encounters, with zero hallucinated recommendations detected (Hayat et al., 27 Jun 2025). MATEC and RareAgents demonstrate similar viability in domain-specialist and rare-disease settings, respectively (Cho et al., 9 Feb 2025, Chen et al., 2024).
6. Interaction Design, Communication, and Human Factors
- Persona and Communication Quality: Rule-based conversational Doctor Agents can boost perceived trust and acceptance by mimicking human physicians’ linguistic markers—empirically shown to enhance intimacy and trust relative to control or family-persona agents in user studies, even without deep technical automation (Hwang et al., 2021).
- Multimodal and UI Features: Several agents employ avatars, multiple input/output modalities (voice, gesture, text, click-based anatomical selection), and fallback dialogue hierarchies to humanize interaction, improve adherence, and cover non-clinical topics (Yan et al., 2021, Torkestani et al., 1 Feb 2025).
- Language and Cultural Adaptation: Prompt-optimized multi-agent systems like Dr.Copilot support real-time qualitative evaluation and communication feedback for doctor–patient interactions in low-resource languages such as Romanian (Niculae et al., 15 Jul 2025).
7. Socio-Cognitive and Reflexive Modeling
- Social Structure Modeling: Beyond pure diagnostic reasoning, some simulations explore the embedding of Doctor Agents in cognitive social structures reflecting social-tie weighting, reflexive confidence updates, and feedback loops from network reputation, closely reflecting phenomena observed in real medical communities (Majumder, 2024).
- Evolution and Agent-Based Simulation: Simulacra such as Agent Hospital implement agent evolution entirely through memory augmentation and reflection without parameter fine-tuning, enabling rapid scaling and curriculum adaptation across tens of thousands of simulated patient interactions (Li et al., 2024).
In summary, the Doctor Agent paradigm fuses sequential interactive dialogue, modular and team-based reasoning, tool integration, robust data pipelines, and sophisticated consensus and reflection mechanisms to emulate and potentially scale key aspects of clinical expertise. These systems are now evaluated via multi-dimensional, scenario-based benchmarks under safety, robustness, and communication metrics closely aligned with real-world clinical competence (Schmidgall et al., 2024, Gong et al., 29 Sep 2025, Kozlova et al., 26 Mar 2026, Bashir et al., 1 Dec 2025). State-of-the-art instantiations involve dynamic adaptation, explicit safety layers, and ablation-tested architectures that ensure both high diagnostic correctness and pragmatic clinical efficacy.