Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling

Published 16 Apr 2026 in cs.CL | (2604.15124v1)

Abstract: Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded LLM systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P<.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper demonstrated that a retrieval-grounded LLM achieved higher overall quality scores than clinician-authored responses in CGM-informed diabetes counseling.
The study employed a blinded multi-rater evaluation using structured prompts, CGM datasets, and clinical vignettes to assess clarity, personalization, and empathy.
Results suggest that LLM-based conversational agents can enhance diabetes counseling efficiency by offering detailed, empathetic advice aligned with clinical guidelines.

Blinded Comparative Evaluation of Retrieval-Grounded LLM and Clinician Authored Responses for CGM-Driven Diabetes Counseling

Introduction

This study addresses the implementation and evaluation of a retrieval-grounded LLM-based conversational agent (CA) specifically focused on continuous glucose monitoring (CGM)-informed diabetes counseling. Amid an increasing prevalence of diabetes and widespread adoption of CGM in routine care, patient-facing interpretation of CGM data remains labor-intensive and inconsistent in clarity, empathy, and guideline concordance. The CA system, powered by GPT-5.1 with retrieval-augmented generation (RAG) grounded in domain-specific medical corpora, was developed for structured, patient-centered explanations of CGM outputs and associated diabetes management queries, explicitly without supporting autonomous clinical decision-making or medication modifications.

Methods

Data and Scenario Construction

Twelve CGM-informed vignettes, spanning both type 1 and type 2 diabetes, were synthesized from the ShanghaiT1DM/T2DM and OhioT1DM datasets. Each case comprised a de-identified CGM trace, patient vignette, and associated visualizations. Six senior UK diabetes specialists each authored responses to 24 structured counseling questions per assigned cases, leading to parallel clinician- and CA-generated answers for every scenario.

CA System: Architecture and Prompting

The CA utilized a non-finetuned GPT-5.1 LLM, orchestrated through structured prompts and RAG, drawing context from:

Precomputed CGM summary statistics (mean, SD, CV, TIR/TBR/TAR per consensus thresholds)
Synthetic patient vignettes
Case-specific visualizations (CGM time series, AGP profiles)
Guideline and educational content curated, segmented, and retrieved using FAISS-backed vector search

The CA was explicitly instructed to avoid direct therapeutic advice, instead focusing on clear, non-judgmental explanation, practical suggestions, and empathic support, with escalation instructions for patterns indicative of acute risk.

Blinded Comparative Evaluation

In a multi-phase protocol:

Each clinician generated responses to assigned case questions.
All clinician and CA responses were then rated by three blinded diabetes specialists (excluding self-assessment), yielding 864 unique evaluations across 288 responses.

Quality was rated on six 1–5 Likert-scale dimensions: clinical accuracy, guideline adherence, actionability, personalization, clarity, and empathy/emotional support. Safety was assessed with a three-level flag. Raters also attempted to classify the presumed source (clinician vs CA).

Analysis employed linear mixed-effects models with random intercepts for case and rater, examining both main effects and domain/dimension-specific differences. Inter-rater reliability was quantified using ICC(2,1) and ICC(2,3) metrics.

Main Results

Quality Ratings

Across all cases and domains, the CA outperformed clinicians on overall quality (mean 4.37 vs 3.58; mean difference 0.782, 95% CI 0.692–0.872, P < 0.001).

Most pronounced differences were detected in:

Empathy (mean difference 1.062; 95% CI 0.948–1.177)
Actionability (0.992; 0.877–1.106)
Personalization and clarity (differences 0.867, 0.713, all P < 0.001)
Consistent superiority in domains involving blood glucose interpretation, medication/treatment guidance, and psychosocial/emotional support.

Performance differences were smallest in domains focused on long-term goals/motivation and technical device issues, indicating that the benefit of CA responses is particularly marked for data-driven interpretation, practical guidance, and affectively sensitive patient queries.

Safety and Source Attribution

Major safety concerns were rare and comparably distributed (CA: 0.7%, Clinician: 0.7%). However, qualitative safety comments more frequently accompanied CA responses, typically regarding alignment with NHS policy (e.g., GLP-1 eligibility criteria), the completeness of question addressing, or feasibility of advice.

Raters could distinguish source with high accuracy (overall 88% among definitive judgments), and between-rater discrimination variance was substantial.

Response Length and Stylistic Determinants

CA responses were nearly three times longer (mean ~211 words) than clinician-authored ones (~73 words). However, additional analysis revealed that response length did not account for the quality difference, except for a negative association between empathy ratings and length for clinicians, which was not observed for the CA.

Inter-Rater Reliability

Consistency among raters was modest—typical for narrative clinical response assessment—and improved with score aggregation. Reliability was higher for CA-authored content in some dimensions, indicating greater standardization and less variability versus human responses.

Implications

Practical Implementation

Findings indicate that high-capacity, retrieval-grounded LLM systems can provide patient-facing explanations of CGM patterns, actionable practical suggestions, and robust empathic communication that, under controlled vignette-based conditions, match or surpass the work of experienced diabetes clinicians on core quality metrics.

Potential clinical roles include:

Adjunctive use in routine CGM review, preparing patients for clinical encounters
Augmentation of standardized patient education and interpretive workflows
Improved efficiency for clinicians by offloading explanatory and routine interpretive tasks, reserving human expertise for individualized clinical decision-making

Despite high structured ratings, the CA did generate responses that triggered interpretive, policy-related, or behavioral feasibility concerns. This underscores the necessity for continued oversight, consistency with rapidly evolving clinical standards, and careful configuration of retrieval content.

Theoretical Insights

The results underscore the capacity of LLMs, when appropriately constrained by RAG with high-quality guideline corpora and robust prompt engineering, to produce context-aware, domain-consistent, and emotionally resonant responses. The CA's relative advantage for actionability and empathy highlights an area where scalable automation can materially affect patient experience, especially for common counseling queries under bounded clinical scope.

However, the study exposes the model’s limitations for nuanced therapeutic strategy and real-time individualized reasoning, with structural and stylistic differences rendering outputs consistently distinguishable from those authored by clinicians.

Limitations and Future Work

Vignette-based assessment, rather than deployment in live patient-clinician-CG system workflows, restricts external validity.
Limited to common-case scenarios; performance in complex, rare, or ambiguous presentations remains untested.
Clinician diversity and inter-rater scoring heterogeneity influence comparative ratings.
Findings pertain to GPT-5.1 and the specific RAG configuration; generalizability to other foundation models or domain corpora is not established.

Prospective, pragmatic trials in clinical settings are required to measure impacts on consultation efficiency, patient comprehension, subsequent management decisions, and risk signaling. Governance frameworks must address safe integration, source transparency, and oversight in real-world clinical deployment.

Conclusion

Retrieval-grounded LLM-based conversational agents can achieve and often surpass clinician-mean quality in CGM-informed diabetes counseling along multiple axes, particularly empathy and actionability, when used within strictly delineated explanatory and educational domains. These systems offer tangible value for scalable patient education and standardization of CGM interpretation. Nevertheless, current evidence does not support their use for autonomous therapeutic recommendations, personalized medication adjustments, or as replacements for clinical judgment. Prospective validation in real-world care pathways is essential to determine optimal roles, safety parameters, and patient/clinician acceptance.

Markdown Report Issue