- The paper demonstrated that a retrieval-grounded LLM achieved higher overall quality scores than clinician-authored responses in CGM-informed diabetes counseling.
- The study employed a blinded multi-rater evaluation using structured prompts, CGM datasets, and clinical vignettes to assess clarity, personalization, and empathy.
- Results suggest that LLM-based conversational agents can enhance diabetes counseling efficiency by offering detailed, empathetic advice aligned with clinical guidelines.
Blinded Comparative Evaluation of Retrieval-Grounded LLM and Clinician Authored Responses for CGM-Driven Diabetes Counseling
Introduction
This study addresses the implementation and evaluation of a retrieval-grounded LLM-based conversational agent (CA) specifically focused on continuous glucose monitoring (CGM)-informed diabetes counseling. Amid an increasing prevalence of diabetes and widespread adoption of CGM in routine care, patient-facing interpretation of CGM data remains labor-intensive and inconsistent in clarity, empathy, and guideline concordance. The CA system, powered by GPT-5.1 with retrieval-augmented generation (RAG) grounded in domain-specific medical corpora, was developed for structured, patient-centered explanations of CGM outputs and associated diabetes management queries, explicitly without supporting autonomous clinical decision-making or medication modifications.
Methods
Data and Scenario Construction
Twelve CGM-informed vignettes, spanning both type 1 and type 2 diabetes, were synthesized from the ShanghaiT1DM/T2DM and OhioT1DM datasets. Each case comprised a de-identified CGM trace, patient vignette, and associated visualizations. Six senior UK diabetes specialists each authored responses to 24 structured counseling questions per assigned cases, leading to parallel clinician- and CA-generated answers for every scenario.
CA System: Architecture and Prompting
The CA utilized a non-finetuned GPT-5.1 LLM, orchestrated through structured prompts and RAG, drawing context from:
- Precomputed CGM summary statistics (mean, SD, CV, TIR/TBR/TAR per consensus thresholds)
- Synthetic patient vignettes
- Case-specific visualizations (CGM time series, AGP profiles)
- Guideline and educational content curated, segmented, and retrieved using FAISS-backed vector search
The CA was explicitly instructed to avoid direct therapeutic advice, instead focusing on clear, non-judgmental explanation, practical suggestions, and empathic support, with escalation instructions for patterns indicative of acute risk.
Blinded Comparative Evaluation
In a multi-phase protocol:
- Each clinician generated responses to assigned case questions.
- All clinician and CA responses were then rated by three blinded diabetes specialists (excluding self-assessment), yielding 864 unique evaluations across 288 responses.
Quality was rated on six 1–5 Likert-scale dimensions: clinical accuracy, guideline adherence, actionability, personalization, clarity, and empathy/emotional support. Safety was assessed with a three-level flag. Raters also attempted to classify the presumed source (clinician vs CA).
Analysis employed linear mixed-effects models with random intercepts for case and rater, examining both main effects and domain/dimension-specific differences. Inter-rater reliability was quantified using ICC(2,1) and ICC(2,3) metrics.
Main Results
Quality Ratings
Across all cases and domains, the CA outperformed clinicians on overall quality (mean 4.37 vs 3.58; mean difference 0.782, 95% CI 0.692–0.872, P < 0.001).
Most pronounced differences were detected in:
- Empathy (mean difference 1.062; 95% CI 0.948–1.177)
- Actionability (0.992; 0.877–1.106)
- Personalization and clarity (differences 0.867, 0.713, all P < 0.001)
- Consistent superiority in domains involving blood glucose interpretation, medication/treatment guidance, and psychosocial/emotional support.
Performance differences were smallest in domains focused on long-term goals/motivation and technical device issues, indicating that the benefit of CA responses is particularly marked for data-driven interpretation, practical guidance, and affectively sensitive patient queries.
Safety and Source Attribution
Major safety concerns were rare and comparably distributed (CA: 0.7%, Clinician: 0.7%). However, qualitative safety comments more frequently accompanied CA responses, typically regarding alignment with NHS policy (e.g., GLP-1 eligibility criteria), the completeness of question addressing, or feasibility of advice.
Raters could distinguish source with high accuracy (overall 88% among definitive judgments), and between-rater discrimination variance was substantial.
Response Length and Stylistic Determinants
CA responses were nearly three times longer (mean ~211 words) than clinician-authored ones (~73 words). However, additional analysis revealed that response length did not account for the quality difference, except for a negative association between empathy ratings and length for clinicians, which was not observed for the CA.
Inter-Rater Reliability
Consistency among raters was modest—typical for narrative clinical response assessment—and improved with score aggregation. Reliability was higher for CA-authored content in some dimensions, indicating greater standardization and less variability versus human responses.
Implications
Practical Implementation
Findings indicate that high-capacity, retrieval-grounded LLM systems can provide patient-facing explanations of CGM patterns, actionable practical suggestions, and robust empathic communication that, under controlled vignette-based conditions, match or surpass the work of experienced diabetes clinicians on core quality metrics.
Potential clinical roles include:
- Adjunctive use in routine CGM review, preparing patients for clinical encounters
- Augmentation of standardized patient education and interpretive workflows
- Improved efficiency for clinicians by offloading explanatory and routine interpretive tasks, reserving human expertise for individualized clinical decision-making
Despite high structured ratings, the CA did generate responses that triggered interpretive, policy-related, or behavioral feasibility concerns. This underscores the necessity for continued oversight, consistency with rapidly evolving clinical standards, and careful configuration of retrieval content.
Theoretical Insights
The results underscore the capacity of LLMs, when appropriately constrained by RAG with high-quality guideline corpora and robust prompt engineering, to produce context-aware, domain-consistent, and emotionally resonant responses. The CA's relative advantage for actionability and empathy highlights an area where scalable automation can materially affect patient experience, especially for common counseling queries under bounded clinical scope.
However, the study exposes the model’s limitations for nuanced therapeutic strategy and real-time individualized reasoning, with structural and stylistic differences rendering outputs consistently distinguishable from those authored by clinicians.
Limitations and Future Work
- Vignette-based assessment, rather than deployment in live patient-clinician-CG system workflows, restricts external validity.
- Limited to common-case scenarios; performance in complex, rare, or ambiguous presentations remains untested.
- Clinician diversity and inter-rater scoring heterogeneity influence comparative ratings.
- Findings pertain to GPT-5.1 and the specific RAG configuration; generalizability to other foundation models or domain corpora is not established.
Prospective, pragmatic trials in clinical settings are required to measure impacts on consultation efficiency, patient comprehension, subsequent management decisions, and risk signaling. Governance frameworks must address safe integration, source transparency, and oversight in real-world clinical deployment.
Conclusion
Retrieval-grounded LLM-based conversational agents can achieve and often surpass clinician-mean quality in CGM-informed diabetes counseling along multiple axes, particularly empathy and actionability, when used within strictly delineated explanatory and educational domains. These systems offer tangible value for scalable patient education and standardization of CGM interpretation. Nevertheless, current evidence does not support their use for autonomous therapeutic recommendations, personalized medication adjustments, or as replacements for clinical judgment. Prospective validation in real-world care pathways is essential to determine optimal roles, safety parameters, and patient/clinician acceptance.