Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
123 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Dr.Copilot: Enhancing Telemedicine Dialogue

Updated 16 July 2025
  • Dr.Copilot is a multi-agent large language model system designed to enhance patient-doctor communication in Romanian telemedicine by focusing on clarity and empathy.
  • It employs a sequential process with scoring, recommendation, and reconciliation agents to provide actionable feedback on doctor responses.
  • Empirical evaluations demonstrate significant improvements in communication quality and patient satisfaction, validating its structured, feedback-driven approach.

Dr.Copilot is a multi-agent LLM system specifically developed to enhance the quality of patient-doctor communication in Romanian telemedicine settings (Niculae et al., 15 Jul 2025). Unlike prior LLM-based medical copilots, which primarily target clinical accuracy or automation, Dr.Copilot focuses on evaluating and improving how doctors present written responses to patients—prioritizing clarity, empathy, and overall communicative effectiveness. By deploying a prompt-optimized, open-weight LLM pipeline capable of delivering structured, interpretable, and actionable feedback on real doctor answers, Dr.Copilot addresses the critical need for better presentation quality, data privacy, and language-specific adaptation in under-resourced medical environments.

1. System Architecture and Multi-Agent Design

Dr.Copilot consists of three LLM agents that process and refine doctor responses on a Romanian telemedicine platform:

  • Scoring Agent: Evaluates a doctor’s written answer along 17 clearly defined axes of presentation quality (e.g., empathy, problems addressed, grammatical correctness, use of platform functionalities). Each axis is associated with a DSPy prompt signature specifying its input (patient query and doctor response) and its expected output (typically a Likert-scale or binary label).
  • Recommender Agent: Based on the scoring, this agent suggests precise, actionable improvements in plain, constructive language. The recommendations are indicative, not prescriptive; for instance, advising a doctor to elaborate on symptom causes or clarify next steps for the patient.
  • Reconciliation Agent: Applies the suggestions from the Recommender Agent to produce a revised version of the original response and then re-runs the scoring process to objectively quantify improvements.

These agents run sequentially, creating a closed feedback loop: (1) scoring, (2) recommendation, (3) simulated revision, and (4) re-scoring. This approach allows immediate, interpretable quantification of the effect of a given recommendation, both in offline and live deployment contexts.

2. Prompt and Feedback Optimization via DSPy

Dr.Copilot leverages the DSPy framework for automatic prompt optimization—critically important for effective LLM alignment in low-resource, highly specialized linguistic scenarios. Three distinct optimization protocols are utilized:

  • Labeled Few-Shot: Base prompts are enriched with a small set of handpicked, diverse examples to set a standard for output style and accuracy.
  • Bootstrap Few-Shot: The model itself generates and selects additional in-domain training examples from unlabeled data, expanding its coverage of edge cases and common failures.
  • SIMBA Optimizer: An iterative mechanism refines prompt candidates, evaluating them against human ground-truth annotations and selecting the best-performing variants (using metrics such as Pearson correlation for continuous scales and F₁ for binary tasks).

This results in highly targeted prompt signatures for each quality axis, balancing conciseness and expressiveness. The entire prompt optimization process can be formalized as

P=argmaxPCandidatesfm(P)P^* = \mathop{\arg\max}_{P \in \mathrm{Candidates}} f_m(P)

where fm(P)f_m(P) is the evaluation metric (e.g., Pearson or F₁) for prompt PP.

3. Feedback Dimensions and Structured Evaluation

Dr.Copilot provides feedback using 17 interpretable axes. Prominent examples include:

  • Empathy: Rating the emotional tone and warmth of the response (1–5 scale).
  • Problems Addressed: Assessing how completely the doctor addresses the patient’s concerns.
  • Grammatical Correctness: Flagging sentences with syntactic or orthographic errors.
  • Platform Functionality: Checking correct use of features (e.g., clarifications).
  • Explanation Completeness: Determining whether the answer adequately covers causes, symptoms, risk factors, next actions, and treatment options.
  • Other Checks: Inappropriate abbreviations, punctuation, and whether an in-person visit is mistakenly recommended online.

For instance, a doctor’s terse answer might trigger the following recommendation: “Răspunsul ar putea beneficia de detalii suplimentare referitoare la cauzele posibile ale simptomelor prezentate.” (Translation: “The answer could benefit from additional details regarding the possible causes of the presented symptoms.”)

This targeted, multi-axis approach allows granular guidance and supports iterative, user-driven refinement without interfering with the underlying clinical content.

4. Data, Model Selection, and Deployment Strategies

Given the scarcity of labeled Romanian medical interaction data, the system was trained and validated using only 100 hand-annotated doctor-patient exchange pairs (with over 100,000 total consultations available). Human annotation focused on quality rather than quantity to maximize reliability—an essential adaptation for low-resource, privacy-constrained environments.

To ensure security and compliance with patient data regulations, Dr.Copilot strictly uses open-weight models (Gemma 12B, Gemma 27B, MedGemma-27B) that run entirely on-premise. For scalable, low-latency deployment, vLLM is used to run up to 17 simultaneous agent evaluations, resulting in an average response latency of approximately five seconds.

All doctor-patient data, modeling, and evaluation are contained within the telemedicine provider’s infrastructure, with no reliance on external API calls or third-party cloud computation.

5. Empirical Evaluation and Impact

The offline and live deployment results for Dr.Copilot highlight several outcomes:

  • Model Performance: Among the tested agents, MedGemma-27B coupled with the SIMBA prompt optimizer achieved the highest human alignment measured by Pearson correlation and F₁ score, depending on axis type.
  • Self-Evaluation (Reconciliation): Using the LLM-as-a-Judge paradigm, revisions made per recommender advice led to an estimated 37% improvement across axes in simulation, and a 51% gain measured during live deployment with doctors.
  • Live Deployment with Doctors: Across 212 evaluation requests and 449 recommendations, 49 responses were actively revised by doctors according to system feedback. This intervention led to a 70.22% improvement in the like-to-response ratio (from 23.98% to 40.82%)—a direct indicator of increased patient satisfaction and communication quality on the platform.
  • Privacy and Safety: By focusing solely on communication and presentation (never modifying clinical content), Dr.Copilot avoids direct medical liability while substantially improving patient-perceived quality of service.

6. Implications and Pioneering Contributions

Dr.Copilot demonstrates several novel characteristics in the context of digital health:

  • Targeting Low-Resource Languages: This system is, to date, one of the first LLM-based medical assistants optimized and deployed in a low-resource language setting (Romanian) with on-premise LLMs.
  • Structured Presentation Feedback: The use of explicit axes, prompt-optimized scoring, and plain-language recommendations provides a template for transparent, human-centric LLM feedback loops in safety-critical environments.
  • Adaptive, Multi-Agent Coordination: Sequential, agent-coordinated evaluation, recommendation, and reconciliation lead to objective, interpretable improvement quantification.
  • Real-World Measured Gains: Deployment with 41 doctors demonstrated clear, sustained improvements in patient satisfaction and communication quality, validating the multi-agent, prompt-optimized paradigm in live medical settings.

7. Conclusion

Dr.Copilot advances the field of medical AI copilots by shifting the focus from automation of clinical reasoning to optimization of communication quality, leveraging a multi-agent architecture, prompt optimization pipeline, and secure, language-adapted deployment (Niculae et al., 15 Jul 2025). Its empirical results underscore the value of structured, interpretable feedback; its open-weight, on-premise deployment strategy demonstrates the viability of LLMs in privacy-restricted, underrepresented language contexts. As such, Dr.Copilot provides a detailed, reproducible framework for integrating LLM-based feedback systems in global telemedicine and related domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.