Agent-Guided Symptom Elicitation

Updated 7 May 2026

Agent-guided symptom elicitation is an AI-driven approach that uses LLMs and multi-agent systems to iteratively collect and refine patient symptom data through structured dialogue.
The method partitions tasks among specialized agents—for history taking, question selection, evidence gathering, and diagnostic reasoning—to simulate clinical evaluations.
Empirical studies show that such strategies yield higher diagnostic accuracy and information coverage compared to traditional, user-led or static evaluation methods.

Agent-guided symptom elicitation refers to the use of artificial intelligence agents—typically LLMs or multi-agent systems—to systematically and proactively collect symptom data from patients or simulated patients in a structured, multi-turn dialogue. Rather than passively responding to user inputs or relying on complete patient records upfront, agent-guided methods iteratively inquire about missing or discriminative symptoms, construct an evolving patient profile, and use this incrementally acquired evidence to refine differential diagnoses or clinical assessment. Recent work has demonstrated that such strategies substantially improve diagnostic accuracy, information completeness, and process explainability relative to user-led or static evaluation paradigms.

1. Architectures and Agent Coordination Mechanisms

Agent-guided symptom elicitation frameworks partition responsibilities among specialized modules or agents to simulate clinical reasoning and history-taking. For example, MEDDxAgent (Rose et al., 26 Feb 2025) is built around three core modules orchestrated by the DDxDriver: a history-taking simulator combining a “doctor” LLM (question generation) and a “patient” LLM (simulated, vignette-consistent responses), a knowledge retrieval agent for external medical lookup, and a diagnosis strategy agent to periodically update the ranked differential. The DDxDriver maintains the evolving symptom/demographic profile and dialog history, orchestrating module invocations in a ReAct-style loop—interleaving explicit reasoning (“Thought”) and selected actions (“Action”) that determine which module to call next.

Triage systems such as AI Triage (Rashidian et al., 4 Jun 2025) decompose the process further, adding subagents for health data planning and retrieval, summarization, guideline verification, and multi-turn evidence gathering orchestrated through lightweight glue code. MAGI (Bi et al., 25 Apr 2025) in psychiatric assessment employs four agents: a navigation agent enforcing the branching interview logic, a judgment agent for real-time response classification, an adaptive question agent blending diagnostic, explanatory, and empathetic styles, and a diagnosis agent emitting explicit symptom-to-criterion chains of thought.

“Strong Reasoning Isn’t Enough” (Long et al., 27 Jan 2026) and CA-MDA (Lin et al., 2020) argue for bifurcating the system into evidence collectors and diagnostic reasoners. The REFINE strategy (Long et al., 27 Jan 2026) explicitly employs a loop of information collector, evidence organizer, diagnosis reasoner, and diagnosis verifier—where the latter provides targeted feedback about missing evidences to guide subsequent queries.

2. Algorithms for Question Selection and Dialogue Management

Most agent-guided frameworks lack an explicit formal utility or information-gain formula for question selection, instead relying on chain-of-thought prompting or modular heuristics to identify high-impact questions. MEDDxAgent (Rose et al., 26 Feb 2025) uses the DDxDriver orchestrator to identify the “highest-impact missing symptom or antecedent” given the partial patient profile and may reference the current differential to guide discriminative queries, updating in fixed or dynamic iteration cycles.

SymptomAI (Breda et al., 5 May 2026) applies two general strategies: “canonical” interviews follow a checklist of history elements (onset, location, quality, severity, etc.), each queried in turn until all are addressed; “dynamic” interviews allow the LLM to generate its own follow-ups, with internal reasoning approximating an information-gain rule (which question most reduces diagnostic uncertainty). While no closed-form calculation is implemented, the model's chain-of-thought resembles the Bayesian utility:

$IG(q) = H[D\,|\, \text{history}] - E_{a}[ H[D\,|\,\text{history} \cup (q:a)] ]$

The MAGI framework (Bi et al., 25 Apr 2025) frames each node of the interview tree as a decision: the judgment agent assigns POS/NEG/UNCERTAIN, and the navigation agent advances, branches, or loops accordingly. The adaptive question agent explicitly optimizes a utility function over question styles:

$U(s \,|\, \text{ctx}) = w_I \cdot IG(s;\text{ctx}) + w_C \cdot Clar(s;\text{ctx}) + w_E \cdot Emot(s;\text{ctx}) - \lambda \cdot Cost(s)$

In CA-MDA (Lin et al., 2020), the agent selects the next symptom based on a Q-network maximizing expected value conditioned on current confidence statistics from an ensemble of diagnostic heads. The decision to stop is set by a neuro-inspired threshold comparing the leading diagnosis's posterior probability against alternatives with variance margins.

REFINE (Long et al., 27 Jan 2026) periodically queries a diagnosis verifier, which enumerates missing atomic evidences still needed to support a hypothesis; this feedback is injected into the information collector’s context to actively target subsequent questions at uncollected evidences.

3. Handling Patient Responses and Profile Updates

Across all frameworks, each response from the patient—affirmed, denied, or unknown—is logged and appended to the patient profile or history. In MEDDxAgent (Rose et al., 26 Feb 2025), the DDxDriver incorporates each (yes/no/unknown) answer, and the diagnosis strategy agent re-ranks differentials as new evidence accumulates. In refined psychiatric interviews (Bi et al., 25 Apr 2025), each node-action generates a tuple (node, response, judgment), which is explicitly tracked by the navigation agent. CA-MDA (Lin et al., 2020) maintains a visited-symptom mask and updates the dialogue state vector with each response. SymptomAI (Breda et al., 5 May 2026) continuously updates a free-text “history of present illness” (HPI) summary after each follow-up.

The feedback loop is central: new evidence dynamically constrains further questioning and enables earlier stopping if sufficient discriminative features have been elicited.

4. Evaluation Metrics: Information Coverage and Diagnostic Accuracy

Recent work formalizes the objective of symptom elicitation via explicit coverage and accuracy metrics. Information Coverage Rate (ICR) (Long et al., 27 Jan 2026) quantifies the proportion of relevant evidences (atomic symptom/exam facts) uncovered by the agent:

$\mathrm{ICR} = \frac{|\widehat{E} \cap E|}{|E|}$

where $E$ is the gold set of necessary evidences and $\widehat{E}$ is those collected during the session. Diagnostic success rate (SR), top-k ground-truth prediction accuracy (e.g., GTPA@1, GTPA@5 (Rose et al., 26 Feb 2025)), average diagnosis rank, and progress rate (change in ground-truth rank per iteration) are also reported.

SymptomAI (Breda et al., 5 May 2026) evaluates agentic vs. user-guided modes in a RCT: top-5 diagnosis accuracy improves by 27.3% when the system guides symptom elicitation. Empirical evidence from (Long et al., 27 Jan 2026) demonstrates that high ICR correlates with diagnostic success and that REFINE, which tightly couples verification with collection, yields maximal coverage and accuracy across common and rare-disease case sets. In dementia assessment (Breithaupt et al., 14 Sep 2025), agent recall (sensitivity) reached 80.9% and specificity 90.8% compared to blinded specialist interviews.

5. Empirical Results and Comparative Performance

Agent-guided elicitation consistently yields superior performance over passive or user-guided conditions. MEDDxAgent (Rose et al., 26 Feb 2025) shows GTPA@1 improving from 0.45 (5 questions) to 0.86 (15 questions, 3 iterations) in interactive diagnosis settings, with rapid convergence observed within 10–15 turns. SymptomAI (Breda et al., 5 May 2026) demonstrates that enforcing a multi-turn, checklist-based history taking raises diagnostic hit rates by 25–30% relative to unconstrained user-initiated chats.

In simulation-based triage (Rashidian et al., 4 Jun 2025), clinician reviewers rated AI agent–patient interactions as 97.7% consistent with EHR-derived vignettes, and case summaries matched gold findings in 99.2% of cases. In psychiatric interviews (Bi et al., 25 Apr 2025), multi-agent orchestration advanced dialogue relevance, completeness, and overall diagnostic F₁ by large margins over single-agent or knowledge-only baselines.

REFINE (Long et al., 27 Jan 2026) closes the gap between interactive and static consultation by ensuring critical evidences are not omitted, particularly improving coverage and accuracy on rare or complex cases.

6. Design Principles and Common Methodological Challenges

Key design principles established across agent-guided frameworks include:

Enforce structured, multi-turn history taking before issuing a diagnosis (Breda et al., 5 May 2026)
Employ adaptive question selection, balancing checklist completeness with model-driven exploration (Breda et al., 5 May 2026, Bi et al., 25 Apr 2025)
Integrate modular agents specialized for evidence gathering, reasoning, and verification (Rose et al., 26 Feb 2025, Bi et al., 25 Apr 2025, Long et al., 27 Jan 2026)
Systematically log and update the evidence profile with every turn (Rose et al., 26 Feb 2025, Bi et al., 25 Apr 2025, Long et al., 27 Jan 2026)
Evaluate symptom elicitation using both evidence coverage and downstream diagnostic correctness metrics (Long et al., 27 Jan 2026)

Challenges remain. Many systems lack formal policies for question utility, rely heavily on prompt engineering, and may over-query when strict stopping conditions are not met. Coverage of rare diseases and edge-cases is inherently limited by underlying data representativeness (Rashidian et al., 4 Jun 2025). Simulator realism is typically high in agent–agent evaluations but must be validated against noisier, real-world patient communications.

7. Extensions, Limitations, and Future Directions

Recent agents focus on modular extensibility: integrating additional specialty agents (e.g., mental health, pediatrics), deploying modular orchestration for model collaboration, and incorporating structured evidence verification (Rashidian et al., 4 Jun 2025, Long et al., 27 Jan 2026). Explicit inclusion of information-theoretic symptom-selection policies is identified as an approach for further efficiency gains (Rashidian et al., 4 Jun 2025).

Empirical studies validate substantial improvements in safety and reliability when enforcing minimum-coverage symptom interviews in real-world deployments (Breda et al., 5 May 2026). However, open challenges include handling uncertainty in patient responses, robustly managing comorbid or atypical presentations, and integrating real-time human-in-the-loop review in high-risk cases. Further work will involve hybrid model–clinician workflows, longitudinal evaluation on diverse patient populations, and rigorous privacy frameworks for sensitive health dialogs.

References:

(Rose et al., 26 Feb 2025) MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis (Rashidian et al., 4 Jun 2025) AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data (Breithaupt et al., 14 Sep 2025) Designing and Evaluating a Conversational Agent for Early Detection of Alzheimer's Disease and Related Dementias (Bi et al., 25 Apr 2025) MAGI: Multi-Agent Guided Interview for Psychiatric Assessment (Lin et al., 2020) Towards Causality-Aware Inferring: A Sequential Discriminative Approach for Medical Diagnosis (Breda et al., 5 May 2026) SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment (Long et al., 27 Jan 2026) Strong Reasoning Isn't Enough: Evaluating Evidence Elicitation in Interactive Diagnosis