Towards physician-centered oversight of conversational diagnostic AI (2507.15743v1)

Published 21 Jul 2025 in cs.AI, cs.CL, cs.HC, and cs.LG

Abstract: Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians' capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.

Summary

The paper introduces an asynchronous oversight framework that decouples patient intake from diagnostic advice via a multi-agent system and strict guardrails.
It implements g-AMIE to generate structured SOAP notes and patient messages, validated through randomized, blinded OSCE evaluations with strong compliance rates.
Empirical results indicate higher diagnostic accuracy and better patient communication compared to traditional methods, albeit with increased oversight editing for verbosity.

Physician-Centered Asynchronous Oversight for Conversational Diagnostic AI: A Technical Analysis

This paper presents a comprehensive framework and empirical evaluation for integrating LLM-based conversational diagnostic AI into clinical workflows under a paradigm of asynchronous, physician-centered oversight. The approach is instantiated via a multi-agent system, guardrailed-AMIE (g-AMIE), which is evaluated against nurse practitioners/physician assistants (NPs/PAs) and early-career primary care physicians (PCPs) in a randomized, blinded virtual OSCE. The paper addresses regulatory, safety, and workflow challenges inherent to deploying diagnostic AI in real-world healthcare.

Asynchronous Oversight Paradigm

The core contribution is the formalization and implementation of asynchronous oversight, decoupling patient intake (history-taking) from the delivery of individualized medical advice. g-AMIE is strictly constrained to abstain from providing diagnoses or management plans during patient interaction. Instead, it generates a structured summary (SOAP note) and a draft patient message, which are reviewed and authorized by an overseeing PCP (o-PCP) via a custom clinician cockpit interface. This design mirrors, but does not replicate, real-world models of physician oversight for NPs/PAs, adapting them to the capabilities and limitations of LLM-based agents.

Key technical elements of the oversight paradigm include:

Guardrails: A dedicated guardrail agent screens all AI outputs for individualized medical advice, using few-shot prompting and a custom classifier validated on annotated dialogue data (95.96% accuracy).
Multi-Agent System: The system comprises a dialogue agent (for intake), a guardrail agent (for safety), and a SOAP note generation agent (for structured summarization and plan proposal), all orchestrated to enforce strict separation of intake and advice.
Clinician Cockpit: The oversight interface, co-designed with practicing PCPs, presents the transcript, editable SOAP note, and patient message, supporting efficient review, editing, and authorization.

Evaluation Methodology

The evaluation leverages a three-step, randomized, virtual OSCE with 60 scenario packs:

Intake with Guardrails: Patient actors interact with g-AMIE, g-PCP, or g-NP/PA, with all groups operating under the same guardrails.
Oversight: Experienced o-PCPs review and edit the generated SOAP notes and patient messages, blinded to the source.
Post-Evaluation: Independent PCPs rate the composite output using validated rubrics (PACES, PCCBP, GMCPQ, QNote) and custom oversight metrics. Auto-raters (LLM-based) provide additional quantitative assessment.

The scenario design ensures coverage of both routine and complex cases, with ground-truth diagnoses and management plans for objective benchmarking.

Empirical Results

Guardrail Adherence and Intake Quality

g-AMIE and g-NP/PA: High compliance with guardrails (90.0% and 91.7% of cases, respectively), with minimal instances of individualized advice. g-PCP compliance was lower (71.7%).
Intake Thoroughness: g-AMIE elicited a higher proportion of scenario-specific "red flag" symptoms (64.2%) compared to g-NP/PA (51.7%) and g-PCP (40.0%). PACES and PCCBP ratings consistently favored g-AMIE for history-taking and patient-centered communication.

SOAP Note and Communication Quality

SOAP Note Ratings: g-AMIE's unedited notes were rated higher for readability, completeness, and accuracy across most sections, except for the Objective section (where all groups performed suboptimally due to the text-only setting).
Patient Message: g-AMIE's messages were preferred by both o-PCPs and patient actors, particularly for clarity, empathy, and comprehensiveness.

Oversight Efficiency and Composite Performance

Oversight Time: Reviewing g-AMIE outputs required slightly more time than control groups, but was ~40% faster than full PCP-led consultations in prior work.
Edit Patterns: o-PCPs made clinically significant edits in ~40% of cases, most frequently in the Plan and patient message sections. Edits to g-AMIE outputs often targeted verbosity and conciseness, with less impact on diagnostic accuracy.
Composite Quality: g-AMIE plus o-PCP oversight achieved higher diagnostic accuracy (81.7% top-1, 91.7% full differential) and management plan coverage (38.5%) than both control groups. However, oversight did not consistently improve, and sometimes reduced, the quality of g-AMIE's initial outputs.

Human Factors and User Experience

Cognitive Load: o-PCPs reported higher cognitive load when editing verbose AI-generated notes, despite overall preference for g-AMIE over human controls.
Patient Preference: Patient actors consistently rated g-AMIE higher for empathy, clarity, and overall satisfaction.
Control Group Comparison: g-NP/PA outperformed g-PCP on several axes, possibly due to greater familiarity with constrained intake and higher median years of experience.

Implementation Considerations

System Architecture

Dialogue Agent: Implements a three-phase protocol (intake, differential diagnosis validation, conclusion), using chain-of-thought reasoning and dynamic summarization at each turn.
Guardrail Agent: Employs a robust, few-shot prompted classifier for real-time screening and revision of outputs, with a maximum of three revision attempts per turn to meet latency constraints.
SOAP Note Agent: Sequentially generates structured notes using constrained decoding (JSON schema, Markdown), ensuring machine-readability and alignment with clinical documentation standards.

Deployment and Scaling

Model Base: Built on Gemini 2.0 Flash, accessible via Google Cloud APIs.
Workflow Integration: The clinician cockpit is designed for EHR integration, supporting asynchronous review and minimizing workflow disruption.
Resource Requirements: The multi-agent orchestration and constrained decoding introduce additional computational overhead, but the system is optimized for batch processing and parallelization in asynchronous settings.

Limitations and Future Directions

Generalizability: The paper's simulated, text-only OSCE setting does not capture the full complexity of real-world clinical workflows or patient heterogeneity.
Human-AI Collaboration: The lack of explicit training for o-PCPs in the oversight paradigm may underestimate potential composite performance. Future work should explore onboarding, calibration, and adaptive interfaces.
Documentation Optimization: Verbosity remains a challenge; research into adaptive summarization and audience-specific note generation is warranted.
Cognitive Load: Further studies should quantify and mitigate cognitive load for overseeing clinicians, potentially via interface refinements and AI-driven prioritization of edits.

Implications and Outlook

This work demonstrates that LLM-based diagnostic AI, when operated under strict guardrails and asynchronous physician oversight, can generate high-quality clinical documentation and patient communication, with efficiency gains over traditional workflows. The asynchronous oversight paradigm addresses key regulatory and safety requirements, enabling scalable deployment without real-time physician availability.

Theoretically, the decoupling of intake and advice, enforced by multi-agent guardrails, provides a template for safe, accountable AI integration in other high-stakes domains. Practically, the findings suggest that AI systems can complement, rather than replace, clinicians—handling structured information gathering and documentation, while leaving final decision-making and patient communication to licensed professionals.

Future developments should focus on real-world clinical validation, adaptive oversight interfaces, and dynamic guardrail tuning to balance safety, efficiency, and user experience. The paradigm outlined here provides a robust foundation for responsible, scalable deployment of conversational diagnostic AI in healthcare.

PDF Markdown

Follow-up Questions

Related Papers

Authors (35)

First 10 authors:

Tweets

https://twitter.com/alan_karthi/status/1947701698350158319

https://twitter.com/fly51fly/status/1949225155521347764

https://twitter.com/timstro/status/1948080505200808076

https://twitter.com/pash22/status/1949558099456540921

https://twitter.com/valentinlievin/status/1948049200362475790

https://twitter.com/susumuota/status/1947810708181090790